Hypothesis

Postnatal environmental exposures, particularly those found in household products and dietary intake, along with specific serum metabolomics profiles, are significantly associated with the BMI Z-score of children aged 6-11 years. Higher concentrations of certain metabolites in serum, reflecting exposure to chemical classes or metals, will correlate with variations in BMI Z-score, controlling for age and other relevant covariates. Some metabolites associated with chemical exposures and dietary patterns can serve as biomarkers for the risk of developing obesity.

Background

Research indicates that postnatal exposure to endocrine-disrupting chemicals (EDCs) such as phthalates, bisphenol A (BPA), and polychlorinated biphenyls (PCBs) can significantly influence body weight and metabolic health (Junge et al., 2018). These chemicals, commonly found in household products and absorbed through dietary intake, are linked to detrimental effects on body weight and metabolic health in children. This hormonal interference can lead to an increased body mass index (BMI) in children, suggesting a potential pathway through which exposure to these chemicals contributes to the development of obesity.

A longitudinal study on Japanese children examined the impact of postnatal exposure (first two years of life) to p,p’-dichlorodiphenyltrichloroethane (p,p’-DDT) and p,p’-dichlorodiphenyldichloroethylene (p,p’-DDE) through breastfeeding (Plouffe et al., 2020). The findings revealed that higher levels of these chemicals in breast milk were associated with increased BMI at 42 months of age. DDT and DDE may interfere with hormonal pathways related to growth and development. These chemicals can mimic or disrupt hormones that regulate metabolism and fat accumulation. This study highlights the importance of understanding how persistent organic pollutants can affect early childhood growth and development.

The study by Harley et al. (2013) investigates the association between prenatal and postnatal Bisphenol A (BPA) exposure and various body composition metrics in children aged 9 years from the CHAMACOS cohort. The study found that higher prenatal BPA exposure was linked to a decrease in BMI and body fat percentages in girls but not boys, suggesting sex-specific effects. Conversely, BPA levels measured at age 9 were positively associated with increased adiposity in both genders, highlighting the different impacts of exposure timing on childhood development.

The 2022 study 2022 study by Uldbjerg et al. explored the effects of combined exposures to multiple EDCs, suggesting that mixtures of these chemicals can have additive or synergistic effects on BMI and obesity risk. Humans are typically exposed to a mixture of chemicals rather than individual EDCs, making it crucial to understand how these mixtures might interact. The research highlighted that the interaction between different EDCs can lead to additive (where the effects simply add up) or even synergistic (where the combined effect is greater than the sum of their separate effects) outcomes. These interactions can significantly amplify the risk factors associated with obesity and metabolic disorders in children. The dose-response relationship found that even low-level exposure to multiple EDCs could result in significant health impacts due to their combined effects.

These studies collectively illustrate the critical role of environmental EDCs in shaping metabolic health outcomes in children, highlighting the necessity for ongoing research and policy intervention to mitigate these risks.

Data Description

This study utilizes data from the subcohort of 1301 mother-child pairs in the HELIX study, who are which aged 6-11 years for whom complete exposure and outcome data were available. Exposure data included detailed dietary records after pregnancy and concentrations of various chemicals like BPA and PCBs in child blood samples. There are categorical and numerical variables, which will include both demographic details and biochemical measurements. This dataset allows for robust statistical analysis to identify potential associations between EDC exposure and changes in BMI Z-scores, considering confounding factors such as age, gender, and socioeconomic status. There are no missing data so there is not need to impute the information. Child BMI Z-scores were calculated based on WHO growth standards.

load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
  filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")

# specific covariates
filtered_covariates <- codebook %>%
  filter(domain == "Covariates" & 
         variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))

#specific phenotype variables
filtered_phenotype <- codebook %>%
  filter(domain == "Phenotype" & 
         variable_name %in% c("hs_zbmi_who"))

# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)
kable(combined_codebook, align = "c", format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)
variable_name domain family subfamily period location period_postnatal description var_type transformation labels labelsshort
h_bfdur_Ter h_bfdur_Ter Lifestyles Lifestyle Diet Postnatal NA NA Breastfeeding duration (weeks) factor Tertiles Breastfeeding Breastfeeding
hs_bakery_prod_Ter hs_bakery_prod_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: bakery products (hs_cookies + hs_pastries) factor Tertiles Bakery prod BakeProd
hs_beverages_Ter hs_beverages_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: beverages (hs_dietsoda+hs_soda) factor Tertiles Soda Soda
hs_break_cer_Ter hs_break_cer_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: breakfast cereal (hs_sugarcer+hs_othcer) factor Tertiles BF cereals BFcereals
hs_caff_drink_Ter hs_caff_drink_Ter Lifestyles Lifestyle Diet Postnatal NA NA Drinks a caffeinated or æenergy drink (eg coca-cola, diet-coke, redbull) factor Tertiles Caffeine Caffeine
hs_dairy_Ter hs_dairy_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: dairy (hs_cheese + hs_milk + hs_yogurt+ hs_probiotic+ hs_desert) factor Tertiles Dairy Dairy
hs_fastfood_Ter hs_fastfood_Ter Lifestyles Lifestyle Diet Postnatal NA NA Visits a fast food restaurant/take away factor Tertiles Fastfood Fastfood
hs_KIDMED_None hs_KIDMED_None Lifestyles Lifestyle Diet Postnatal NA NA Sum of KIDMED indices, without index9 numeric None KIDMED KIDMED
hs_mvpa_prd_alt_None hs_mvpa_prd_alt_None Lifestyles Lifestyle Physical activity Postnatal NA NA Clean & Over-reporting of Moderate-to-Vigorous Physical Activity (min/day) numeric None PA PA
hs_org_food_Ter hs_org_food_Ter Lifestyles Lifestyle Diet Postnatal NA NA Eats organic food factor Tertiles Organicfood Organicfood
hs_proc_meat_Ter hs_proc_meat_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: processed meat (hs_coldmeat+hs_ham) factor Tertiles Processed meat ProcMeat
hs_readymade_Ter hs_readymade_Ter Lifestyles Lifestyle Diet Postnatal NA NA Eats a æready-made supermarket meal factor Tertiles Ready made food ReadyFood
hs_sd_wk_None hs_sd_wk_None Lifestyles Lifestyle Physical activity Postnatal NA NA sedentary behaviour (min/day) numeric None Sedentary Sedentary
hs_total_bread_Ter hs_total_bread_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: bread (hs_darkbread+hs_whbread) factor Tertiles Bread Bread
hs_total_cereal_Ter hs_total_cereal_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: cereal (hs_darkbread + hs_whbread + hs_rice_pasta + hs_sugarcer + hs_othcer + hs_rusks) factor Tertiles Cereals Cereals
hs_total_fish_Ter hs_total_fish_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: fish and seafood (hs_canfish+hs_oilyfish+hs_whfish+hs_seafood) factor Tertiles Fish Fish
hs_total_fruits_Ter hs_total_fruits_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: fruits (hs_canfruit+hs_dryfruit+hs_freshjuice+hs_fruits) factor Tertiles Fruits Fruits
hs_total_lipids_Ter hs_total_lipids_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: Added fat factor Tertiles Diet fat Diet fat
hs_total_meat_Ter hs_total_meat_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: meat (hs_coldmeat+hs_ham+hs_poultry+hs_redmeat) factor Tertiles Meat Meat
hs_total_potatoes_Ter hs_total_potatoes_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: potatoes (hs_frenchfries+hs_potatoes) factor Tertiles Potatoes Potatoes
hs_total_sweets_Ter hs_total_sweets_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: sweets (hs_choco + hs_sweets + hs_sugar) factor Tertiles Sweets Sweets
hs_total_veg_Ter hs_total_veg_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: vegetables (hs_cookveg+hs_rawveg) factor Tertiles Vegetables Vegetables
hs_total_yog_Ter hs_total_yog_Ter Lifestyles Lifestyle Diet Postnatal NA NA Food group: yogurt (hs_yogurt+hs_probiotic) factor Tertiles Yogurt Yogurt
hs_dif_hours_total_None hs_dif_hours_total_None Lifestyles Lifestyle Sleep Postnatal NA NA Total hours of sleep (mean weekdays and night) numeric None Sleep Sleep
hs_as_c_Log2 hs_as_c_Log2 Chemicals Metals As Postnatal NA NA Arsenic (As) in child numeric Logarithm base 2 As As
hs_cd_c_Log2 hs_cd_c_Log2 Chemicals Metals Cd Postnatal NA NA Cadmium (Cd) in child numeric Logarithm base 2 Cd Cd
hs_co_c_Log2 hs_co_c_Log2 Chemicals Metals Co Postnatal NA NA Cobalt (Co) in child numeric Logarithm base 2 Co Co
hs_cs_c_Log2 hs_cs_c_Log2 Chemicals Metals Cs Postnatal NA NA Caesium (Cs) in child numeric Logarithm base 2 Cs Cs
hs_cu_c_Log2 hs_cu_c_Log2 Chemicals Metals Cu Postnatal NA NA Copper (Cu) in child numeric Logarithm base 2 Cu Cu
hs_hg_c_Log2 hs_hg_c_Log2 Chemicals Metals Hg Postnatal NA NA Mercury (Hg) in child numeric Logarithm base 2 Hg Hg
hs_mn_c_Log2 hs_mn_c_Log2 Chemicals Metals Mn Postnatal NA NA Manganese (Mn) in child numeric Logarithm base 2 Mn Mn
hs_mo_c_Log2 hs_mo_c_Log2 Chemicals Metals Mo Postnatal NA NA Molybdenum (Mo) in child numeric Logarithm base 2 Mo Mo
hs_pb_c_Log2 hs_pb_c_Log2 Chemicals Metals Pb Postnatal NA NA Lead (Pb) in child numeric Logarithm base 2 Pb Pb
hs_tl_cdich_None hs_tl_cdich_None Chemicals Metals Tl Postnatal NA NA Dichotomous variable of thallium (Tl) in child factor None Tl Tl
hs_dde_cadj_Log2 hs_dde_cadj_Log2 Chemicals Organochlorines DDE Postnatal NA NA Dichlorodiphenyldichloroethylene (DDE) in child adjusted for lipids numeric Logarithm base 2 DDE DDE
hs_ddt_cadj_Log2 hs_ddt_cadj_Log2 Chemicals Organochlorines DDT Postnatal NA NA Dichlorodiphenyltrichloroethane (DDT) in child adjusted for lipids numeric Logarithm base 2 DDT DDT
hs_hcb_cadj_Log2 hs_hcb_cadj_Log2 Chemicals Organochlorines HCB Postnatal NA NA Hexachlorobenzene (HCB) in child adjusted for lipids numeric Logarithm base 2 HCB HCB
hs_pcb118_cadj_Log2 hs_pcb118_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl -118 (PCB-118) in child adjusted for lipids numeric Logarithm base 2 PCB 118 PCB118
hs_pcb138_cadj_Log2 hs_pcb138_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-138 (PCB-138) in child adjusted for lipids numeric Logarithm base 2 PCB 138 PCB138
hs_pcb153_cadj_Log2 hs_pcb153_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-153 (PCB-153) in child adjusted for lipids numeric Logarithm base 2 PCB 153 PCB153
hs_pcb170_cadj_Log2 hs_pcb170_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-170 (PCB-170) in child adjusted for lipids numeric Logarithm base 2 PCB 170 PCB170
hs_pcb180_cadj_Log2 hs_pcb180_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Polychlorinated biphenyl-180 (PCB-180) in child adjusted for lipids numeric Logarithm base 2 PCB 180 PCB180
hs_sumPCBs5_cadj_Log2 hs_sumPCBs5_cadj_Log2 Chemicals Organochlorines PCBs Postnatal NA NA Sum of PCBs in child adjusted for lipids (4 cohorts) numeric Logarithm base 2 PCBs SumPCB
hs_dep_cadj_Log2 hs_dep_cadj_Log2 Chemicals Organophosphate pesticides DEP Postnatal NA NA Diethyl phosphate (DEP) in child adjusted for creatinine numeric Logarithm base 2 DEP DEP
hs_detp_cadj_Log2 hs_detp_cadj_Log2 Chemicals Organophosphate pesticides DETP Postnatal NA NA Diethyl thiophosphate (DETP) in child adjusted for creatinine numeric Logarithm base 2 DETP DETP
hs_dmdtp_cdich_None hs_dmdtp_cdich_None Chemicals Organophosphate pesticides DMDTP Postnatal NA NA Dichotomous variable of dimethyl dithiophosphate (DMDTP) in child factor None DMDTP DMDTP
hs_dmp_cadj_Log2 hs_dmp_cadj_Log2 Chemicals Organophosphate pesticides DMP Postnatal NA NA Dimethyl phosphate (DMP) in child adjusted for creatinine numeric Logarithm base 2 DMP DMP
hs_dmtp_cadj_Log2 hs_dmtp_cadj_Log2 Chemicals Organophosphate pesticides DMTP Postnatal NA NA Dimethyl thiophosphate (DMTP) in child adjusted for creatinine numeric Logarithm base 2 DMDTP DMTP
hs_pbde153_cadj_Log2 hs_pbde153_cadj_Log2 Chemicals Polybrominated diphenyl ethers (PBDE) PBDE153 Postnatal NA NA Polybrominated diphenyl ether-153 (PBDE-153) in child adjusted for lipids numeric Logarithm base 2 PBDE 153 PBDE153
hs_pbde47_cadj_Log2 hs_pbde47_cadj_Log2 Chemicals Polybrominated diphenyl ethers (PBDE) PBDE47 Postnatal NA NA Polybrominated diphenyl ether-47 (PBDE-47) in child adjusted for lipids numeric Logarithm base 2 PBDE 47 PBDE47
hs_pfhxs_c_Log2 hs_pfhxs_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFHXS Postnatal NA NA Perfluorohexane sulfonate (PFHXS) in child numeric Logarithm base 2 PFHXS PFHXS
hs_pfna_c_Log2 hs_pfna_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFNA Postnatal NA NA Perfluorononanoate (PFNA) in child numeric Logarithm base 2 PFNA PFNA
hs_pfoa_c_Log2 hs_pfoa_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFOA Postnatal NA NA Perfluorooctanoate (PFOA) in child numeric Logarithm base 2 PFOA PFOA
hs_pfos_c_Log2 hs_pfos_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFOS Postnatal NA NA Perfluorooctane sulfonate (PFOS) in child numeric Logarithm base 2 PFOS PFOS
hs_pfunda_c_Log2 hs_pfunda_c_Log2 Chemicals Per- and polyfluoroalkyl substances (PFAS) PFUNDA Postnatal NA NA Perfluoroundecanoate (PFUNDA) in child numeric Logarithm base 2 PFUNDA PFUNDA
hs_bpa_cadj_Log2 hs_bpa_cadj_Log2 Chemicals Phenols BPA Postnatal NA NA Bisphenol A (BPA) in child adjusted for creatinine numeric Logarithm base 2 BPA BPA
hs_bupa_cadj_Log2 hs_bupa_cadj_Log2 Chemicals Phenols BUPA Postnatal NA NA N-Butyl paraben (BUPA) in child adjusted for creatinine numeric Logarithm base 2 BUPA BUPA
hs_etpa_cadj_Log2 hs_etpa_cadj_Log2 Chemicals Phenols ETPA Postnatal NA NA Ethyl paraben (ETPA) in child adjusted for creatinine numeric Logarithm base 2 ETPA ETPA
hs_mepa_cadj_Log2 hs_mepa_cadj_Log2 Chemicals Phenols MEPA Postnatal NA NA Methyl paraben (MEPA) in child adjusted for creatinine numeric Logarithm base 2 MEPA MEPA
hs_oxbe_cadj_Log2 hs_oxbe_cadj_Log2 Chemicals Phenols OXBE Postnatal NA NA Oxybenzone (OXBE) in child adjusted for creatinine numeric Logarithm base 2 OXBE OXBE
hs_prpa_cadj_Log2 hs_prpa_cadj_Log2 Chemicals Phenols PRPA Postnatal NA NA Propyl paraben (PRPA) in child adjusted for creatinine numeric Logarithm base 2 PRPA PRPA
hs_trcs_cadj_Log2 hs_trcs_cadj_Log2 Chemicals Phenols TRCS Postnatal NA NA Triclosan (TRCS) in child adjusted for creatinine numeric Logarithm base 2 TRCS TRCS
hs_mbzp_cadj_Log2 hs_mbzp_cadj_Log2 Chemicals Phthalates MBZP Postnatal NA NA Mono benzyl phthalate (MBzP) in child adjusted for creatinine numeric Logarithm base 2 MBZP MBZP
hs_mecpp_cadj_Log2 hs_mecpp_cadj_Log2 Chemicals Phthalates MECPP Postnatal NA NA Mono-2-ethyl 5-carboxypentyl phthalate (MECPP) in child adjusted for creatinine numeric Logarithm base 2 MECPP MECPP
hs_mehhp_cadj_Log2 hs_mehhp_cadj_Log2 Chemicals Phthalates MEHHP Postnatal NA NA Mono-2-ethyl-5-hydroxyhexyl phthalate (MEHHP) in child adjusted for creatinine numeric Logarithm base 2 MEHHP MEHHP
hs_mehp_cadj_Log2 hs_mehp_cadj_Log2 Chemicals Phthalates MEHP Postnatal NA NA Mono-2-ethylhexyl phthalate (MEHP) in child adjusted for creatinine numeric Logarithm base 2 MEHP MEHP
hs_meohp_cadj_Log2 hs_meohp_cadj_Log2 Chemicals Phthalates MEOHP Postnatal NA NA Mono-2-ethyl-5-oxohexyl phthalate (MEOHP) in child adjusted for creatinine numeric Logarithm base 2 MEOHP MEOHP
hs_mep_cadj_Log2 hs_mep_cadj_Log2 Chemicals Phthalates MEP Postnatal NA NA Monoethyl phthalate (MEP) in child adjusted for creatinine numeric Logarithm base 2 MEP MEP
hs_mibp_cadj_Log2 hs_mibp_cadj_Log2 Chemicals Phthalates MIBP Postnatal NA NA Mono-iso-butyl phthalate (MiBP) in child adjusted for creatinine numeric Logarithm base 2 MIBP MIBP
hs_mnbp_cadj_Log2 hs_mnbp_cadj_Log2 Chemicals Phthalates MNBP Postnatal NA NA Mono-n-butyl phthalate (MnBP) in child adjusted for creatinine numeric Logarithm base 2 MNBP MNBP
hs_ohminp_cadj_Log2 hs_ohminp_cadj_Log2 Chemicals Phthalates OHMiNP Postnatal NA NA Mono-4-methyl-7-hydroxyoctyl phthalate (OHMiNP) in child adjusted for creatinine numeric Logarithm base 2 OHMiNP OHMiNP
hs_oxominp_cadj_Log2 hs_oxominp_cadj_Log2 Chemicals Phthalates OXOMINP Postnatal NA NA Mono-4-methyl-7-oxooctyl phthalate (OXOMiNP) in child adjusted for creatinine numeric Logarithm base 2 OXOMINP OXOMINP
hs_sumDEHP_cadj_Log2 hs_sumDEHP_cadj_Log2 Chemicals Phthalates DEHP Postnatal NA NA Sum of DEHP metabolites (µg/g) in child adjusted for creatinine numeric Logarithm base 2 DEHP SumDEHP
FAS_cat_None FAS_cat_None Chemicals Social and economic capital Economic capital Postnatal NA NA Family affluence score factor None Family affluence FamAfl
hs_contactfam_3cat_num_None hs_contactfam_3cat_num_None Chemicals Social and economic capital Social capital Postnatal NA NA scoial capital: family friends factor None Social contact SocCont
hs_hm_pers_None hs_hm_pers_None Chemicals Social and economic capital Social capital Postnatal NA NA How many people live in your home? numeric None House crowding HouseCrow
hs_participation_3cat_None hs_participation_3cat_None Chemicals Social and economic capital Social capital Postnatal NA NA social capital: structural factor None Social participation SocPartic
hs_cotinine_cdich_None hs_cotinine_cdich_None Chemicals Tobacco Smoke Cotinine Postnatal NA NA Dichotomous variable of cotinine in child factor None Cotinine Cotinine
hs_globalexp2_None hs_globalexp2_None Chemicals Tobacco Smoke Tobacco Smoke Postnatal NA NA Global exposure of the child to ETS (2 categories) factor None ETS ETS
hs_smk_parents_None hs_smk_parents_None Chemicals Tobacco Smoke Tobacco Smoke Postnatal NA NA Tobacco Smoke status of parents (both) factor None Smoking_parents SmokPar
e3_sex_None e3_sex_None Covariates Covariates Child covariate Pregnancy NA NA Child sex (female / male) factor None Child sex Sex
e3_yearbir_None e3_yearbir_None Covariates Covariates Child covariate Pregnancy NA NA Year of birth (2003 to 2009) factor None Year of birth YearBirth
hs_child_age_None hs_child_age_None Covariates Covariates Child covariate Postnatal NA NA Child age at examination (years) numeric None Child age cAge
hs_zbmi_who hs_zbmi_who Phenotype Phenotype Outcome at 6-11 years old Postnatal NA NA Body mass index z-score at 6-11 years old - WHO reference - Standardized on sex and age numeric None Body mass index z-score zBMI

Data Summary for Exposures, Covariates, and Outcome

Data Summary Exposures: Lifestyles

# specific lifestyle exposures
lifestyle_exposures <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

lifestyle_exposome <- dplyr::select(exposome, all_of(lifestyle_exposures))
summarytools::view(dfSummary(lifestyle_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 h_bfdur_Ter [factor]
1. (0,10.8]
2. (10.8,34.9]
3. (34.9,Inf]
506(38.9%)
270(20.8%)
525(40.4%)
0 (0.0%)
2 hs_bakery_prod_Ter [factor]
1. (0,2]
2. (2,6]
3. (6,Inf]
345(26.5%)
423(32.5%)
533(41.0%)
0 (0.0%)
3 hs_break_cer_Ter [factor]
1. (0,1.1]
2. (1.1,5.5]
3. (5.5,Inf]
291(22.4%)
521(40.0%)
489(37.6%)
0 (0.0%)
4 hs_dairy_Ter [factor]
1. (0,14.6]
2. (14.6,25.6]
3. (25.6,Inf]
359(27.6%)
465(35.7%)
477(36.7%)
0 (0.0%)
5 hs_fastfood_Ter [factor]
1. (0,0.132]
2. (0.132,0.5]
3. (0.5,Inf]
143(11.0%)
603(46.3%)
555(42.7%)
0 (0.0%)
6 hs_org_food_Ter [factor]
1. (0,0.132]
2. (0.132,1]
3. (1,Inf]
429(33.0%)
396(30.4%)
476(36.6%)
0 (0.0%)
7 hs_proc_meat_Ter [factor]
1. (0,1.5]
2. (1.5,4]
3. (4,Inf]
366(28.1%)
471(36.2%)
464(35.7%)
0 (0.0%)
8 hs_total_fish_Ter [factor]
1. (0,1.5]
2. (1.5,3]
3. (3,Inf]
389(29.9%)
454(34.9%)
458(35.2%)
0 (0.0%)
9 hs_total_fruits_Ter [factor]
1. (0,7]
2. (7,14.1]
3. (14.1,Inf]
413(31.7%)
407(31.3%)
481(37.0%)
0 (0.0%)
10 hs_total_lipids_Ter [factor]
1. (0,3]
2. (3,7]
3. (7,Inf]
397(30.5%)
403(31.0%)
501(38.5%)
0 (0.0%)
11 hs_total_sweets_Ter [factor]
1. (0,4.1]
2. (4.1,8.5]
3. (8.5,Inf]
344(26.4%)
516(39.7%)
441(33.9%)
0 (0.0%)
12 hs_total_veg_Ter [factor]
1. (0,6]
2. (6,8.5]
3. (8.5,Inf]
404(31.1%)
314(24.1%)
583(44.8%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-23

categorical_lifestyle <- lifestyle_exposome %>% 
  dplyr::select(where(is.factor))

categorical_lifestyle_long <- pivot_longer(
  categorical_lifestyle,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_categorical_vars <- unique(categorical_lifestyle_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
  data <- filter(categorical_lifestyle_long, variable == var)
  
  p <- ggplot(data, aes(x = value, fill = value)) +
    geom_bar(stat = "count") +
    labs(title = paste("Distribution of", var), x = var, y = "Count")
  
  print(p)
  return(p)
})

Breastfeeding Duration: Majority of observations are in the highest duration category, suggesting longer breastfeeding periods are common.

Bakery Products: Shows a relatively even distribution across the three categories, indicating varied consumption levels of bakery products among participants.

Breakfast Cereal: The highest category of cereal consumption is the most common, suggesting a preference for or greater consumption of cereals.

Dairy: Shows a fairly even distribution across all categories, indicating a uniform consumption pattern of dairy products.

Fast Food: Most participants fall into the middle category, indicating moderate consumption of fast food.

Organic Food: Most participants either consume a lot of or no organic food, with fewer in the middle range.

Processed Meat: Consumption levels are fairly evenly distributed, indicating varied dietary habits regarding processed meats.

Bread: Distribution shows a significant leaning towards higher bread consumption.

Cereal: Even distribution across categories suggests varied cereal consumption habits.

Fish and Seafood: Even distribution across categories, indicating varied consumption of fish and seafood.

Fruits: High fruit consumption is the most common, with fewer participants in the lowest category.

Added Fats: More participants consume added fats at the lowest and highest levels, with fewer in the middle.

Sweets: High consumption of sweets is the most common, indicating a preference for or higher access to sugary foods.

Vegetables: Most participants consume a high amount of vegetables.

Data Summary Exposures: Chemicals

# specific chemical exposures
chemical_exposures <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

chemical_exposome <- dplyr::select(exposome, all_of(chemical_exposures))
summarytools::view(dfSummary(chemical_exposome, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hs_cd_c_Log2 [numeric]
Mean (sd) : -4 (1)
min ≤ med ≤ max:
-10.4 ≤ -3.8 ≤ 0.8
IQR (CV) : 1 (-0.3)
695 distinct values 0 (0.0%)
2 hs_co_c_Log2 [numeric]
Mean (sd) : -2.3 (0.6)
min ≤ med ≤ max:
-5.5 ≤ -2.4 ≤ 1.4
IQR (CV) : 0.7 (-0.3)
317 distinct values 0 (0.0%)
3 hs_cs_c_Log2 [numeric]
Mean (sd) : 0.4 (0.6)
min ≤ med ≤ max:
-1.5 ≤ 0.5 ≤ 3.1
IQR (CV) : 0.8 (1.3)
369 distinct values 0 (0.0%)
4 hs_cu_c_Log2 [numeric]
Mean (sd) : 9.8 (0.2)
min ≤ med ≤ max:
9.1 ≤ 9.8 ≤ 12.1
IQR (CV) : 0.3 (0)
345 distinct values 0 (0.0%)
5 hs_hg_c_Log2 [numeric]
Mean (sd) : -0.3 (1.7)
min ≤ med ≤ max:
-10.9 ≤ -0.2 ≤ 3.7
IQR (CV) : 2.1 (-5.6)
698 distinct values 0 (0.0%)
6 hs_mo_c_Log2 [numeric]
Mean (sd) : -0.3 (0.9)
min ≤ med ≤ max:
-9.2 ≤ -0.4 ≤ 5.1
IQR (CV) : 0.8 (-2.9)
593 distinct values 0 (0.0%)
7 hs_pb_c_Log2 [numeric]
Mean (sd) : 3.1 (0.6)
min ≤ med ≤ max:
1.1 ≤ 3.1 ≤ 7.7
IQR (CV) : 0.8 (0.2)
529 distinct values 0 (0.0%)
8 hs_dde_cadj_Log2 [numeric]
Mean (sd) : 4.7 (1.5)
min ≤ med ≤ max:
1.2 ≤ 4.5 ≤ 11.1
IQR (CV) : 1.9 (0.3)
1050 distinct values 0 (0.0%)
9 hs_pcb153_cadj_Log2 [numeric]
Mean (sd) : 3.6 (0.9)
min ≤ med ≤ max:
1.2 ≤ 3.5 ≤ 7.8
IQR (CV) : 1.4 (0.3)
1047 distinct values 0 (0.0%)
10 hs_pcb170_cadj_Log2 [numeric]
Mean (sd) : -0.3 (3)
min ≤ med ≤ max:
-16.8 ≤ 0.3 ≤ 4.8
IQR (CV) : 2.2 (-9.8)
1039 distinct values 0 (0.0%)
11 hs_dep_cadj_Log2 [numeric]
Mean (sd) : 0.2 (3.2)
min ≤ med ≤ max:
-12.6 ≤ 0.9 ≤ 9.4
IQR (CV) : 3.3 (20)
1045 distinct values 0 (0.0%)
12 hs_pbde153_cadj_Log2 [numeric]
Mean (sd) : -4.5 (3.8)
min ≤ med ≤ max:
-17.6 ≤ -2.6 ≤ 4
IQR (CV) : 6.7 (-0.8)
1036 distinct values 0 (0.0%)
13 hs_pfhxs_c_Log2 [numeric]
Mean (sd) : -1.6 (1.3)
min ≤ med ≤ max:
-8.9 ≤ -1.4 ≤ 4.8
IQR (CV) : 1.7 (-0.8)
1061 distinct values 0 (0.0%)
14 hs_pfoa_c_Log2 [numeric]
Mean (sd) : 0.6 (0.6)
min ≤ med ≤ max:
-2.2 ≤ 0.6 ≤ 2.7
IQR (CV) : 0.7 (0.9)
1061 distinct values 0 (0.0%)
15 hs_pfos_c_Log2 [numeric]
Mean (sd) : 1 (1.1)
min ≤ med ≤ max:
-10.4 ≤ 1 ≤ 5.1
IQR (CV) : 1.3 (1.1)
1050 distinct values 0 (0.0%)
16 hs_prpa_cadj_Log2 [numeric]
Mean (sd) : -1.6 (3.8)
min ≤ med ≤ max:
-12 ≤ -2.3 ≤ 10.8
IQR (CV) : 5.2 (-2.4)
1031 distinct values 0 (0.0%)
17 hs_mbzp_cadj_Log2 [numeric]
Mean (sd) : 2.4 (1.2)
min ≤ med ≤ max:
-0.6 ≤ 2.3 ≤ 7.2
IQR (CV) : 1.5 (0.5)
1046 distinct values 0 (0.0%)
18 hs_mibp_cadj_Log2 [numeric]
Mean (sd) : 5.5 (1.1)
min ≤ med ≤ max:
2.3 ≤ 5.4 ≤ 9.8
IQR (CV) : 1.5 (0.2)
1057 distinct values 0 (0.0%)
19 hs_mnbp_cadj_Log2 [numeric]
Mean (sd) : 4.7 (1)
min ≤ med ≤ max:
1.9 ≤ 4.6 ≤ 8.9
IQR (CV) : 1.3 (0.2)
1048 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-23

#separate numeric and categorical data
numeric_chemical <- chemical_exposome %>% 
  dplyr::select(where(is.numeric))

numeric_chemical_long <- pivot_longer(
  numeric_chemical,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_numerical_vars <- unique(numeric_chemical_long$variable)

num_plots <- lapply(unique_numerical_vars, function(var) {
  data <- filter(numeric_chemical_long, variable == var)
  p <- ggplot(data, aes(x = value)) +
    geom_histogram(bins = 30, fill = "blue") +
    labs(title = paste("Histogram of", var), x = "Value", y = "Count")
  print(p)
  return(p)
})

Cadmium (hs_cd_c_Log2): The histogram for Cadmium exposure (hs_cd_c_Log2) shows a relatively symmetric distribution centered around 4 on the log2 scale. Most values range from approximately 3 to 5, with few outliers.

Cobalt (hs_co_c_Log2): The histogram of cobalt levels displays a roughly normal distribution centered around a slight positive skew, peaking around 3.5.

Cesium (hs_cs_c_Log2): Exhibits a right-skewed distribution, indicating that most participants have relatively low exposure levels, but a small number have substantially higher exposures. Majority of the data centered around 1.5 to 2.5

Copper (hs_cu_c_Log2): Shows a right-skewed distribution, suggesting that while most individuals have moderate exposure, a few experience significantly higher levels of copper.

Mercury (hs_hg_c_Log2): This distribution is also right-skewed, common for environmental pollutants, where a majority have lower exposure levels, and a minority have high exposure levels.

Molybdenum (hs_mo_c_Log2): Shows a distribution with a sharp peak and a long right tail, suggesting that while most people have similar exposure levels, a few have exceptionally high exposures. Has a sharp peak around 6, indicating that most values fall within a narrow range.

Lead (hs_pb_c_Log2): The distribution is slightly right-skewed, indicating higher exposure levels in a smaller group of the population compared to the majority.

DDE (hs_dde_cadj_Log2): Shows a pronounced right skew, typical for chemicals that accumulate in the environment and in human tissues, indicating higher levels of exposure in a smaller subset of the population..

PCB 153 (hs_pcb153_cadj_Log2): Has a distribution with right skewness, suggesting that exposure to these compounds is higher among a smaller segment of the population. Bimodal, indicating two peaks around 2 and 4.

PCB 170 (hs_pcb170_cadj_Log2): This histograms show a significant right skew, indicating lower concentrations of these chemicals in most samples, with fewer samples showing higher concentrations. This pattern suggests that while most individuals have low exposure, a few may have considerably higher levels.

DEP (hs_dep_cadj_Log2): DEP exposure has a sharp peak around 6, indicating a narrow distribution of values.

PBDE 153 (hs_pbde153_cadj_Log2): This histogram shows a bimodal distribution, with peaks around 1 and 4.

PFHxS: Distribution peaks around 5 with a broad spread, indicating variation in PFHxS levels.

PFOA: Shows a peak around 6 with a symmetrical distribution, indicating consistent PFOA levels.

PFOS: Similar to PFOA, the distribution peaks around 6, indicating consistent PFOS levels.

Propyl Paraben (hs_prpa_cadj_Log2): The distribution peaks around 6 with a broad spread, indicating variability in propyl paraben levels.

Monobenzyl Phthalate (hs_mbzp_cadj_Log2): This histogram shows a right-skewed distribution. Most values cluster at the lower end, indicating a common lower exposure level among subjects, with a long tail towards higher values suggesting occasional higher exposures. Shows a broad distribution with a peak around 4, indicating variation in monobenzyl phthalate levels. Indicates consistent but varied exposure levels.

Monoisobutyl Phthalate (hs_mibp_cadj_Log2): The distribution is right-skewed, similar to MBZP, but with a smoother decline. This pattern also indicates that while most subjects have lower exposure levels, a few experience significantly higher exposures.

Mono-n-butyl Phthalate (hs_mnbp_cadj_Log2): Peaks around 4, indicating consistent exposure levels with some variation. Few outliers are present.

numeric_chemical <- select_if(chemical_exposome, is.numeric)
cor_matrix <- cor(numeric_chemical, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)

Data Summary Covariates

Covariates were selected on its impact with the postnatal nature of the child. The only exception is sex and year of birth, which was considered as pregnancy. This were used since there are differences amongst gender as well as depending on the child’s age and when they are born.

# Specified covariates
specific_covariates <- c(
  "e3_sex_None", 
  "e3_yearbir_None",
  "hs_child_age_None"
)

covariate_data <- dplyr::select(covariates, all_of(specific_covariates))
summarytools::view(dfSummary(covariate_data, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 e3_sex_None [factor]
1. female
2. male
608(46.7%)
693(53.3%)
0 (0.0%)
2 e3_yearbir_None [factor]
1. 2003
2. 2004
3. 2005
4. 2006
5. 2007
6. 2008
7. 2009
55(4.2%)
107(8.2%)
241(18.5%)
256(19.7%)
250(19.2%)
379(29.1%)
13(1.0%)
0 (0.0%)
3 hs_child_age_None [numeric]
Mean (sd) : 8 (1.6)
min ≤ med ≤ max:
5.4 ≤ 8 ≤ 12.1
IQR (CV) : 2.4 (0.2)
879 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-23

#separate numeric and categorical data
numeric_covariates <- covariate_data %>% 
  dplyr::select(where(is.numeric))

numeric_covariates_long <- pivot_longer(
  numeric_covariates,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_numerical_vars <- unique(numeric_covariates_long$variable)

num_plots <- lapply(unique_numerical_vars, function(var) {
  data <- filter(numeric_covariates_long, variable == var)
  p <- ggplot(data, aes(x = value)) +
    geom_histogram(bins = 30, fill = "blue") +
    labs(title = paste("Histogram of", var), x = "Value", y = "Count")
  print(p)
  return(p)
})

Child’s Age (hs_child_age): This histogram is multimodal, reflecting several peaks across different ages. This could be indicative of the data collection points or particular age groups being studied.

categorical_covariates <- covariate_data %>% 
  dplyr::select(where(is.factor))

categorical_covariates_long <- pivot_longer(
  categorical_covariates,
  cols = everything(),
  names_to = "variable",
  values_to = "value"
)

unique_categorical_vars <- unique(categorical_covariates_long$variable)
categorical_plots <- lapply(unique_categorical_vars, function(var) {
  data <- filter(categorical_covariates_long, variable == var)
  
  p <- ggplot(data, aes(x = value, fill = value)) +
    geom_bar(stat = "count") +
    labs(title = paste("Distribution of", var), x = var, y = "Count")
  
  print(p)
  return(p)
})

Cohorts (h_cohort): The distribution shows the count of subjects across six different cohorts. All cohorts have a substantial number of subjects, with cohort 5 showing the highest participation.

Gender Distribution (e3_sex): The gender distribution is nearly balanced with a slight higher count for males compared to females.

Year of Birth (e3_yearbir): This chart shows that the majority of subjects were born in the later years, with a significant increase in 2009, indicating perhaps a larger recruitment or a specific cohort focus that year.

Data Summary Outcome: Phenotype

outcome_BMI <- phenotype %>% 
  dplyr::select(hs_zbmi_who)
summarytools::view(dfSummary(outcome_BMI, style = 'grid', plain.ascii = FALSE, valid.col = FALSE, headings = FALSE), method = "render")
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 hs_zbmi_who [numeric]
Mean (sd) : 0.4 (1.2)
min ≤ med ≤ max:
-3.6 ≤ 0.3 ≤ 4.7
IQR (CV) : 1.5 (3)
421 distinct values 0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.0)
2024-07-23

Descriptive Tables

By Gender

combined_data <- cbind(covariate_data, lifestyle_exposome, chemical_exposome, outcome_BMI)

combined_data <- combined_data[, !duplicated(colnames(combined_data))]

# sex variable to a factor for stratification
combined_data$e3_sex_None <- as.factor(combined_data$e3_sex_None)
levels(combined_data$e3_sex_None) <- c("Male", "Female")

render_cont <- function(x) {
  with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}

render_cat <- function(x) {
  c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}

table1_formula <- ~ 
  hs_child_age_None + e3_yearbir_None + 
  hs_zbmi_who +
  h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
  hs_proc_meat_Ter +
  hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
  hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
  hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
  hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
  hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
  hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_sex_None

table1(
  table1_formula,
  data = combined_data,
  render.continuous = render_cont,
  render.categorical = render_cat,
  overall = TRUE,
  topclass = "Rtable1-shade"
)
Male
(N=608)
Female
(N=693)
TRUE
(N=1301)
hs_child_age_None 7.91 (1.58) 8.03 (1.64) 7.98 (1.61)
e3_yearbir_None
2003 25 (4.1 %) 30 (4.3 %) 55 (4.2 %)
2004 46 (7.6 %) 61 (8.8 %) 107 (8.2 %)
2005 121 (19.9 %) 120 (17.3 %) 241 (18.5 %)
2006 108 (17.8 %) 148 (21.4 %) 256 (19.7 %)
2007 128 (21.1 %) 122 (17.6 %) 250 (19.2 %)
2008 177 (29.1 %) 202 (29.1 %) 379 (29.1 %)
2009 3 (0.5 %) 10 (1.4 %) 13 (1.0 %)
hs_zbmi_who 0.35 (1.15) 0.45 (1.22) 0.40 (1.19)
h_bfdur_Ter
(0,10.8] 231 (38.0 %) 275 (39.7 %) 506 (38.9 %)
(10.8,34.9] 118 (19.4 %) 152 (21.9 %) 270 (20.8 %)
(34.9,Inf] 259 (42.6 %) 266 (38.4 %) 525 (40.4 %)
hs_bakery_prod_Ter
(0,2] 164 (27.0 %) 181 (26.1 %) 345 (26.5 %)
(2,6] 188 (30.9 %) 235 (33.9 %) 423 (32.5 %)
(6,Inf] 256 (42.1 %) 277 (40.0 %) 533 (41.0 %)
hs_break_cer_Ter
(0,1.1] 141 (23.2 %) 150 (21.6 %) 291 (22.4 %)
(1.1,5.5] 251 (41.3 %) 270 (39.0 %) 521 (40.0 %)
(5.5,Inf] 216 (35.5 %) 273 (39.4 %) 489 (37.6 %)
hs_dairy_Ter
(0,14.6] 175 (28.8 %) 184 (26.6 %) 359 (27.6 %)
(14.6,25.6] 229 (37.7 %) 236 (34.1 %) 465 (35.7 %)
(25.6,Inf] 204 (33.6 %) 273 (39.4 %) 477 (36.7 %)
hs_fastfood_Ter
(0,0.132] 75 (12.3 %) 68 (9.8 %) 143 (11.0 %)
(0.132,0.5] 273 (44.9 %) 330 (47.6 %) 603 (46.3 %)
(0.5,Inf] 260 (42.8 %) 295 (42.6 %) 555 (42.7 %)
hs_org_food_Ter
(0,0.132] 211 (34.7 %) 218 (31.5 %) 429 (33.0 %)
(0.132,1] 191 (31.4 %) 205 (29.6 %) 396 (30.4 %)
(1,Inf] 206 (33.9 %) 270 (39.0 %) 476 (36.6 %)
hs_proc_meat_Ter
(0,1.5] 175 (28.8 %) 191 (27.6 %) 366 (28.1 %)
(1.5,4] 227 (37.3 %) 244 (35.2 %) 471 (36.2 %)
(4,Inf] 206 (33.9 %) 258 (37.2 %) 464 (35.7 %)
hs_total_fish_Ter
(0,1.5] 183 (30.1 %) 206 (29.7 %) 389 (29.9 %)
(1.5,3] 224 (36.8 %) 230 (33.2 %) 454 (34.9 %)
(3,Inf] 201 (33.1 %) 257 (37.1 %) 458 (35.2 %)
hs_total_fruits_Ter
(0,7] 174 (28.6 %) 239 (34.5 %) 413 (31.7 %)
(7,14.1] 216 (35.5 %) 191 (27.6 %) 407 (31.3 %)
(14.1,Inf] 218 (35.9 %) 263 (38.0 %) 481 (37.0 %)
hs_total_lipids_Ter
(0,3] 193 (31.7 %) 204 (29.4 %) 397 (30.5 %)
(3,7] 171 (28.1 %) 232 (33.5 %) 403 (31.0 %)
(7,Inf] 244 (40.1 %) 257 (37.1 %) 501 (38.5 %)
hs_total_sweets_Ter
(0,4.1] 149 (24.5 %) 195 (28.1 %) 344 (26.4 %)
(4.1,8.5] 251 (41.3 %) 265 (38.2 %) 516 (39.7 %)
(8.5,Inf] 208 (34.2 %) 233 (33.6 %) 441 (33.9 %)
hs_total_veg_Ter
(0,6] 190 (31.2 %) 214 (30.9 %) 404 (31.1 %)
(6,8.5] 136 (22.4 %) 178 (25.7 %) 314 (24.1 %)
(8.5,Inf] 282 (46.4 %) 301 (43.4 %) 583 (44.8 %)
hs_cd_c_Log2 -3.99 (0.98) -3.95 (1.09) -3.97 (1.04)
hs_co_c_Log2 -2.37 (0.61) -2.32 (0.64) -2.34 (0.63)
hs_cs_c_Log2 0.44 (0.58) 0.44 (0.57) 0.44 (0.57)
hs_cu_c_Log2 9.81 (0.25) 9.84 (0.22) 9.83 (0.23)
hs_hg_c_Log2 -0.24 (1.59) -0.35 (1.75) -0.30 (1.68)
hs_mo_c_Log2 -0.32 (0.83) -0.31 (0.96) -0.32 (0.90)
hs_dde_cadj_Log2 4.63 (1.48) 4.70 (1.50) 4.67 (1.49)
hs_pcb153_cadj_Log2 3.47 (0.86) 3.63 (0.94) 3.56 (0.90)
hs_pcb170_cadj_Log2 -0.60 (3.22) -0.05 (2.77) -0.31 (3.00)
hs_dep_cadj_Log2 0.27 (3.16) 0.06 (3.25) 0.16 (3.21)
hs_pbde153_cadj_Log2 -4.66 (3.86) -4.40 (3.80) -4.53 (3.83)
hs_pfhxs_c_Log2 -1.62 (1.30) -1.53 (1.31) -1.57 (1.31)
hs_pfoa_c_Log2 0.60 (0.55) 0.62 (0.56) 0.61 (0.55)
hs_pfos_c_Log2 0.95 (1.15) 0.99 (1.08) 0.97 (1.11)
hs_prpa_cadj_Log2 -1.26 (3.96) -1.91 (3.68) -1.61 (3.82)
hs_mbzp_cadj_Log2 2.42 (1.23) 2.47 (1.22) 2.44 (1.22)
hs_mibp_cadj_Log2 5.54 (1.09) 5.39 (1.12) 5.46 (1.11)
hs_mnbp_cadj_Log2 4.77 (1.08) 4.60 (0.96) 4.68 (1.02)

By Year of Birth

combined_data$e3_yearbir_None <- as.factor(combined_data$e3_yearbir_None)

render_cont <- function(x) {
  with(stats.default(x), sprintf("%0.2f (%0.2f)", MEAN, SD))
}

render_cat <- function(x) {
  c("", sapply(stats.default(x), function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT))))
}

table1_formula_year <- ~ 
  hs_child_age_None + e3_sex_None + 
  hs_zbmi_who +
  h_bfdur_Ter + hs_bakery_prod_Ter + hs_break_cer_Ter + hs_dairy_Ter + hs_fastfood_Ter + hs_org_food_Ter +
  hs_proc_meat_Ter +
  hs_total_fish_Ter + hs_total_fruits_Ter + hs_total_lipids_Ter + hs_total_sweets_Ter + hs_total_veg_Ter +
  hs_cd_c_Log2 + hs_co_c_Log2 + hs_cs_c_Log2 + hs_cu_c_Log2 +
  hs_hg_c_Log2 + hs_mo_c_Log2 + hs_dde_cadj_Log2 + hs_pcb153_cadj_Log2 +
  hs_pcb170_cadj_Log2 + hs_dep_cadj_Log2 + hs_pbde153_cadj_Log2 +
  hs_pfhxs_c_Log2 + hs_pfoa_c_Log2 + hs_pfos_c_Log2 + hs_prpa_cadj_Log2 +
  hs_mbzp_cadj_Log2 + hs_mibp_cadj_Log2 + hs_mnbp_cadj_Log2 | e3_yearbir_None

table1(
  table1_formula_year,
  data = combined_data,
  render.continuous = render_cont,
  render.categorical = render_cat,
  overall = TRUE,
  topclass = "Rtable1-shade"
)
2003
(N=55)
2004
(N=107)
2005
(N=241)
2006
(N=256)
2007
(N=250)
2008
(N=379)
2009
(N=13)
TRUE
(N=1301)
hs_child_age_None 11.32 (0.47) 10.81 (0.38) 9.19 (0.57) 8.38 (0.41) 6.91 (0.51) 6.41 (0.31) 6.18 (0.15) 7.98 (1.61)
e3_sex_None
Male 25 (45.5 %) 46 (43.0 %) 121 (50.2 %) 108 (42.2 %) 128 (51.2 %) 177 (46.7 %) 3 (23.1 %) 608 (46.7 %)
Female 30 (54.5 %) 61 (57.0 %) 120 (49.8 %) 148 (57.8 %) 122 (48.8 %) 202 (53.3 %) 10 (76.9 %) 693 (53.3 %)
hs_zbmi_who 0.32 (1.15) 0.12 (1.17) 0.45 (1.11) 0.34 (1.11) 0.41 (1.22) 0.49 (1.28) 0.79 (1.14) 0.40 (1.19)
h_bfdur_Ter
(0,10.8] 35 (63.6 %) 65 (60.7 %) 92 (38.2 %) 81 (31.6 %) 90 (36.0 %) 138 (36.4 %) 5 (38.5 %) 506 (38.9 %)
(10.8,34.9] 15 (27.3 %) 28 (26.2 %) 69 (28.6 %) 42 (16.4 %) 38 (15.2 %) 75 (19.8 %) 3 (23.1 %) 270 (20.8 %)
(34.9,Inf] 5 (9.1 %) 14 (13.1 %) 80 (33.2 %) 133 (52.0 %) 122 (48.8 %) 166 (43.8 %) 5 (38.5 %) 525 (40.4 %)
hs_bakery_prod_Ter
(0,2] 16 (29.1 %) 21 (19.6 %) 88 (36.5 %) 117 (45.7 %) 48 (19.2 %) 53 (14.0 %) 2 (15.4 %) 345 (26.5 %)
(2,6] 12 (21.8 %) 30 (28.0 %) 75 (31.1 %) 90 (35.2 %) 77 (30.8 %) 132 (34.8 %) 7 (53.8 %) 423 (32.5 %)
(6,Inf] 27 (49.1 %) 56 (52.3 %) 78 (32.4 %) 49 (19.1 %) 125 (50.0 %) 194 (51.2 %) 4 (30.8 %) 533 (41.0 %)
hs_break_cer_Ter
(0,1.1] 16 (29.1 %) 38 (35.5 %) 60 (24.9 %) 62 (24.2 %) 39 (15.6 %) 70 (18.5 %) 6 (46.2 %) 291 (22.4 %)
(1.1,5.5] 19 (34.5 %) 36 (33.6 %) 103 (42.7 %) 103 (40.2 %) 103 (41.2 %) 153 (40.4 %) 4 (30.8 %) 521 (40.0 %)
(5.5,Inf] 20 (36.4 %) 33 (30.8 %) 78 (32.4 %) 91 (35.5 %) 108 (43.2 %) 156 (41.2 %) 3 (23.1 %) 489 (37.6 %)
hs_dairy_Ter
(0,14.6] 15 (27.3 %) 21 (19.6 %) 67 (27.8 %) 58 (22.7 %) 73 (29.2 %) 119 (31.4 %) 6 (46.2 %) 359 (27.6 %)
(14.6,25.6] 12 (21.8 %) 27 (25.2 %) 90 (37.3 %) 101 (39.5 %) 80 (32.0 %) 149 (39.3 %) 6 (46.2 %) 465 (35.7 %)
(25.6,Inf] 28 (50.9 %) 59 (55.1 %) 84 (34.9 %) 97 (37.9 %) 97 (38.8 %) 111 (29.3 %) 1 (7.7 %) 477 (36.7 %)
hs_fastfood_Ter
(0,0.132] 6 (10.9 %) 14 (13.1 %) 20 (8.3 %) 21 (8.2 %) 22 (8.8 %) 57 (15.0 %) 3 (23.1 %) 143 (11.0 %)
(0.132,0.5] 38 (69.1 %) 45 (42.1 %) 140 (58.1 %) 153 (59.8 %) 94 (37.6 %) 129 (34.0 %) 4 (30.8 %) 603 (46.3 %)
(0.5,Inf] 11 (20.0 %) 48 (44.9 %) 81 (33.6 %) 82 (32.0 %) 134 (53.6 %) 193 (50.9 %) 6 (46.2 %) 555 (42.7 %)
hs_org_food_Ter
(0,0.132] 17 (30.9 %) 30 (28.0 %) 77 (32.0 %) 51 (19.9 %) 96 (38.4 %) 155 (40.9 %) 3 (23.1 %) 429 (33.0 %)
(0.132,1] 20 (36.4 %) 39 (36.4 %) 78 (32.4 %) 99 (38.7 %) 68 (27.2 %) 87 (23.0 %) 5 (38.5 %) 396 (30.4 %)
(1,Inf] 18 (32.7 %) 38 (35.5 %) 86 (35.7 %) 106 (41.4 %) 86 (34.4 %) 137 (36.1 %) 5 (38.5 %) 476 (36.6 %)
hs_proc_meat_Ter
(0,1.5] 6 (10.9 %) 26 (24.3 %) 38 (15.8 %) 37 (14.5 %) 91 (36.4 %) 162 (42.7 %) 6 (46.2 %) 366 (28.1 %)
(1.5,4] 36 (65.5 %) 41 (38.3 %) 80 (33.2 %) 92 (35.9 %) 83 (33.2 %) 134 (35.4 %) 5 (38.5 %) 471 (36.2 %)
(4,Inf] 13 (23.6 %) 40 (37.4 %) 123 (51.0 %) 127 (49.6 %) 76 (30.4 %) 83 (21.9 %) 2 (15.4 %) 464 (35.7 %)
hs_total_fish_Ter
(0,1.5] 11 (20.0 %) 20 (18.7 %) 32 (13.3 %) 36 (14.1 %) 106 (42.4 %) 180 (47.5 %) 4 (30.8 %) 389 (29.9 %)
(1.5,3] 31 (56.4 %) 56 (52.3 %) 80 (33.2 %) 70 (27.3 %) 82 (32.8 %) 128 (33.8 %) 7 (53.8 %) 454 (34.9 %)
(3,Inf] 13 (23.6 %) 31 (29.0 %) 129 (53.5 %) 150 (58.6 %) 62 (24.8 %) 71 (18.7 %) 2 (15.4 %) 458 (35.2 %)
hs_total_fruits_Ter
(0,7] 33 (60.0 %) 58 (54.2 %) 73 (30.3 %) 55 (21.5 %) 68 (27.2 %) 119 (31.4 %) 7 (53.8 %) 413 (31.7 %)
(7,14.1] 13 (23.6 %) 24 (22.4 %) 88 (36.5 %) 76 (29.7 %) 81 (32.4 %) 123 (32.5 %) 2 (15.4 %) 407 (31.3 %)
(14.1,Inf] 9 (16.4 %) 25 (23.4 %) 80 (33.2 %) 125 (48.8 %) 101 (40.4 %) 137 (36.1 %) 4 (30.8 %) 481 (37.0 %)
hs_total_lipids_Ter
(0,3] 9 (16.4 %) 18 (16.8 %) 97 (40.2 %) 82 (32.0 %) 67 (26.8 %) 122 (32.2 %) 2 (15.4 %) 397 (30.5 %)
(3,7] 26 (47.3 %) 45 (42.1 %) 71 (29.5 %) 59 (23.0 %) 80 (32.0 %) 116 (30.6 %) 6 (46.2 %) 403 (31.0 %)
(7,Inf] 20 (36.4 %) 44 (41.1 %) 73 (30.3 %) 115 (44.9 %) 103 (41.2 %) 141 (37.2 %) 5 (38.5 %) 501 (38.5 %)
hs_total_sweets_Ter
(0,4.1] 16 (29.1 %) 17 (15.9 %) 87 (36.1 %) 96 (37.5 %) 51 (20.4 %) 74 (19.5 %) 3 (23.1 %) 344 (26.4 %)
(4.1,8.5] 13 (23.6 %) 38 (35.5 %) 99 (41.1 %) 103 (40.2 %) 104 (41.6 %) 157 (41.4 %) 2 (15.4 %) 516 (39.7 %)
(8.5,Inf] 26 (47.3 %) 52 (48.6 %) 55 (22.8 %) 57 (22.3 %) 95 (38.0 %) 148 (39.1 %) 8 (61.5 %) 441 (33.9 %)
hs_total_veg_Ter
(0,6] 19 (34.5 %) 26 (24.3 %) 75 (31.1 %) 63 (24.6 %) 81 (32.4 %) 136 (35.9 %) 4 (30.8 %) 404 (31.1 %)
(6,8.5] 11 (20.0 %) 25 (23.4 %) 53 (22.0 %) 71 (27.7 %) 55 (22.0 %) 95 (25.1 %) 4 (30.8 %) 314 (24.1 %)
(8.5,Inf] 25 (45.5 %) 56 (52.3 %) 113 (46.9 %) 122 (47.7 %) 114 (45.6 %) 148 (39.1 %) 5 (38.5 %) 583 (44.8 %)
hs_cd_c_Log2 -4.32 (1.45) -3.93 (1.01) -3.90 (1.15) -3.91 (1.00) -3.90 (0.88) -4.05 (0.97) -4.09 (1.85) -3.97 (1.04)
hs_co_c_Log2 -2.35 (0.51) -2.38 (0.59) -2.54 (0.60) -2.45 (0.68) -2.27 (0.58) -2.20 (0.62) -2.11 (0.55) -2.34 (0.63)
hs_cs_c_Log2 1.04 (0.47) 1.03 (0.50) 0.71 (0.43) 0.65 (0.43) 0.18 (0.50) 0.07 (0.45) -0.04 (0.40) 0.44 (0.57)
hs_cu_c_Log2 9.83 (0.18) 9.89 (0.30) 9.79 (0.21) 9.76 (0.22) 9.82 (0.22) 9.88 (0.23) 9.86 (0.26) 9.83 (0.23)
hs_hg_c_Log2 0.56 (1.21) 0.73 (1.27) 0.41 (1.35) 0.12 (1.35) -0.78 (1.73) -1.11 (1.70) -0.75 (1.41) -0.30 (1.68)
hs_mo_c_Log2 -0.85 (1.80) -0.43 (0.64) -0.45 (0.83) -0.32 (0.81) -0.26 (0.83) -0.16 (0.88) -0.27 (0.94) -0.32 (0.90)
hs_dde_cadj_Log2 4.06 (1.34) 3.90 (1.09) 4.23 (1.24) 4.32 (1.02) 4.96 (1.66) 5.28 (1.62) 5.16 (1.23) 4.67 (1.49)
hs_pcb153_cadj_Log2 3.54 (0.73) 3.47 (0.72) 3.80 (0.84) 4.03 (0.79) 3.32 (0.95) 3.25 (0.87) 3.69 (0.96) 3.56 (0.90)
hs_pcb170_cadj_Log2 0.48 (1.19) 0.17 (2.28) 0.64 (2.35) 1.08 (1.77) -1.46 (3.58) -1.32 (3.29) -0.92 (3.65) -0.31 (3.00)
hs_dep_cadj_Log2 -0.62 (3.08) 0.10 (3.10) 0.13 (3.31) 0.19 (2.90) 0.65 (3.10) -0.02 (3.42) -0.25 (3.57) 0.16 (3.21)
hs_pbde153_cadj_Log2 -4.43 (3.35) -5.26 (3.63) -4.26 (3.77) -3.68 (3.60) -4.30 (3.95) -5.20 (3.93) -5.08 (3.90) -4.53 (3.83)
hs_pfhxs_c_Log2 -0.58 (0.66) -0.53 (0.85) -1.01 (0.99) -1.07 (0.88) -2.03 (1.40) -2.37 (1.22) -2.68 (0.71) -1.57 (1.31)
hs_pfoa_c_Log2 0.53 (0.44) 0.55 (0.53) 0.67 (0.49) 0.64 (0.51) 0.66 (0.63) 0.55 (0.58) 0.35 (0.50) 0.61 (0.55)
hs_pfos_c_Log2 1.67 (0.68) 1.55 (0.72) 1.18 (1.00) 1.10 (1.14) 0.80 (1.05) 0.62 (1.17) 0.40 (0.94) 0.97 (1.11)
hs_prpa_cadj_Log2 -3.21 (3.68) -2.67 (3.04) -0.92 (3.92) -1.65 (3.89) -1.57 (3.80) -1.58 (3.80) 0.87 (4.50) -1.61 (3.82)
hs_mbzp_cadj_Log2 2.69 (1.08) 2.84 (1.11) 2.43 (1.19) 2.32 (1.14) 2.36 (1.34) 2.45 (1.24) 2.21 (1.51) 2.44 (1.22)
hs_mibp_cadj_Log2 5.19 (1.08) 5.59 (1.07) 4.84 (0.97) 4.82 (0.98) 5.84 (1.00) 6.02 (0.93) 6.27 (0.97) 5.46 (1.11)
hs_mnbp_cadj_Log2 3.99 (0.86) 4.34 (0.88) 4.26 (0.90) 4.52 (0.90) 4.85 (0.95) 5.11 (1.05) 5.22 (0.77) 4.68 (1.02)

All Interested Data

This will look into the full data provided by the diet and chemical variables. Regression models will look into which variables are important for analysis.

outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]
#the full chemicals list
chemicals_full <- c(
  "hs_as_c_Log2",
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mn_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_tl_cdich_None",
  "hs_dde_cadj_Log2",
  "hs_ddt_cadj_Log2",
  "hs_hcb_cadj_Log2",
  "hs_pcb118_cadj_Log2",
  "hs_pcb138_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_pcb180_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_detp_cadj_Log2",
  "hs_dmdtp_cdich_None",
  "hs_dmp_cadj_Log2",
  "hs_dmtp_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pbde47_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfna_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_pfunda_c_Log2",
  "hs_bpa_cadj_Log2",
  "hs_bupa_cadj_Log2",
  "hs_etpa_cadj_Log2",
  "hs_mepa_cadj_Log2",
  "hs_oxbe_cadj_Log2",
  "hs_prpa_cadj_Log2",
  "hs_trcs_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mecpp_cadj_Log2",
  "hs_mehhp_cadj_Log2",
  "hs_mehp_cadj_Log2",
  "hs_meohp_cadj_Log2",
  "hs_mep_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2",
  "hs_ohminp_cadj_Log2",
  "hs_oxominp_cadj_Log2",
  "hs_cotinine_cdich_None",
  "hs_globalexp2_None"
)

#postnatal diet for child
postnatal_diet <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_beverages_Ter",
  "hs_break_cer_Ter",
  "hs_caff_drink_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_readymade_Ter",
  "hs_total_bread_Ter",
  "hs_total_cereal_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_meat_Ter",
  "hs_total_potatoes_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter",
  "hs_total_yog_Ter"
)

chemicals_columns <- c(chemicals_full)
all_chemicals <- exposome %>% dplyr::select(all_of(chemicals_columns))

diet_columns <- c(postnatal_diet)
all_diet <- exposome %>% dplyr::select(all_of(diet_columns))

all_columns <- c(chemicals_full, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))

chemicals_outcome_cov <- cbind(outcome_cov, all_chemicals)

diet_outcome_cov <- cbind(outcome_cov, all_diet)

interested_data <- cbind(outcome_cov, extracted_exposome)
head(interested_data)
interested_data_corr <- select_if(interested_data, is.numeric)
cor_matrix <- cor(interested_data_corr, method = "pearson")
cor_matrix <- cor(interested_data_corr, method = "spearman")
cor_matrix <- cor(interested_data_corr, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.4)

Comparing Models with and without Covariates

Baseline Covariates Model

covariates_columns <- colnames(covariate_data)
covariates_data <- interested_data %>% dplyr::select(all_of(covariates_columns))

y <- interested_data$hs_zbmi_who

set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
train_data <- interested_data[trainIndex,]
test_data  <- interested_data[-trainIndex,]

x_train_cov <- model.matrix(~ . -1, data = covariates_data[trainIndex,])
x_test_cov <- model.matrix(~ . -1, data = covariates_data[-trainIndex,])

y_train <- train_data$hs_zbmi_who
y_test <- test_data$hs_zbmi_who

LASSO

fit_lasso_cov <- cv.glmnet(x_train_cov, y_train, alpha = 1, family = "gaussian")
plot(fit_lasso_cov, xvar = "lambda", main = "Lasso Coefficients Path")

best_lambda_lasso <- fit_lasso_cov$lambda.min
coef(fit_lasso_cov, s = best_lambda_lasso)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                           s1
## (Intercept)         0.413483
## e3_sex_Nonefemale   .       
## e3_sex_Nonemale     .       
## e3_yearbir_None2004 .       
## e3_yearbir_None2005 .       
## e3_yearbir_None2006 .       
## e3_yearbir_None2007 .       
## e3_yearbir_None2008 .       
## e3_yearbir_None2009 .       
## hs_child_age_None   .
lasso_predictions_cov <- predict(fit_lasso_cov, s = best_lambda_lasso, newx = x_test_cov)
mse_lasso_cov <- mean((y_test - lasso_predictions_cov)^2)
rmse_lasso_cov <- sqrt(mse_lasso_cov)
cat("Lasso Test MSE (Covariates only):", mse_lasso_cov, "\n")
## Lasso Test MSE (Covariates only): 1.319431
cat("Lasso Test RMSE (Covariates only):", rmse_lasso_cov, "\n")
## Lasso Test RMSE (Covariates only): 1.148665

Ridge

fit_ridge_cov <- cv.glmnet(x_train_cov, y_train, alpha = 0, family = "gaussian")
plot(fit_ridge_cov, xvar = "lambda", main = "Ridge Coefficients Path")

best_lambda_ridge <- fit_ridge_cov$lambda.min
coef(fit_ridge_cov, s = best_lambda_ridge)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                               s1
## (Intercept)          0.469753606
## e3_sex_Nonefemale   -0.004014094
## e3_sex_Nonemale      0.004014005
## e3_yearbir_None2004 -0.045803833
## e3_yearbir_None2005 -0.006042782
## e3_yearbir_None2006 -0.004660190
## e3_yearbir_None2007  0.001372564
## e3_yearbir_None2008  0.020817988
## e3_yearbir_None2009  0.072947788
## hs_child_age_None   -0.007171273
ridge_predictions_cov <- predict(fit_ridge_cov, s = best_lambda_ridge, newx = x_test_cov)
mse_ridge_cov <- mean((y_test - ridge_predictions_cov)^2)
rmse_ridge_cov <- sqrt(mse_ridge_cov)
cat("Ridge Test MSE (Covariates only):", mse_ridge_cov, "\n")
## Ridge Test MSE (Covariates only): 1.317463
cat("Ridge Test RMSE (Covariates only):", rmse_ridge_cov, "\n")
## Ridge Test RMSE (Covariates only): 1.147808

Elastic Net

fit_enet_cov <- cv.glmnet(x_train_cov, y_train, alpha = 0.5, family = "gaussian")
plot(fit_enet_cov, xvar = "lambda", main = "Elastic Net Coefficients Path")

best_lambda_enet <- fit_enet_cov$lambda.min
coef(fit_enet_cov, s = best_lambda_enet)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                              s1
## (Intercept)          0.54361182
## e3_sex_Nonefemale    .         
## e3_sex_Nonemale      .         
## e3_yearbir_None2004 -0.15055331
## e3_yearbir_None2005  .         
## e3_yearbir_None2006  .         
## e3_yearbir_None2007  .         
## e3_yearbir_None2008  0.03548813
## e3_yearbir_None2009  0.11072642
## hs_child_age_None   -0.01602837
enet_predictions_cov <- predict(fit_enet_cov, s = best_lambda_enet, newx = x_test_cov)
mse_enet_cov <- mean((y_test - enet_predictions_cov)^2)
rmse_enet_cov <- sqrt(mse_enet_cov)
cat("Elastic Net Test MSE (Covariates only):", mse_enet_cov, "\n")
## Elastic Net Test MSE (Covariates only): 1.318349
cat("Elastic Net Test RMSE (Covariates only):", rmse_enet_cov, "\n")
## Elastic Net Test RMSE (Covariates only): 1.148194

Chemicals Data

Chemicals will be analyzed for the best variables using the regression methods.

LASSO

#LASSO train/test 70-30
set.seed(101)
train_indices <- sample(seq_len(nrow(chemicals_outcome_cov)), size = floor(0.7 * nrow(interested_data)))
test_indices <- setdiff(seq_len(nrow(chemicals_outcome_cov)), train_indices)

x_train <- as.matrix(chemicals_outcome_cov[train_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_train <- chemicals_outcome_cov$hs_zbmi_who[train_indices]

x_test <- as.matrix(chemicals_outcome_cov[test_indices, setdiff(names(chemicals_outcome_cov), "hs_zbmi_who")])
y_test <- chemicals_outcome_cov$hs_zbmi_who[test_indices]

x_train_chemicals_only <- as.matrix(chemicals_outcome_cov[train_indices, chemicals_full])
x_test_chemicals_only <- as.matrix(chemicals_outcome_cov[test_indices, chemicals_full])

fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 1, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)

plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

best_lambda <- fit_without_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
##                                   s1
## (Intercept)            -4.7797230131
## hs_as_c_Log2            .           
## hs_cd_c_Log2           -0.0238815730
## hs_co_c_Log2           -0.0011670319
## hs_cs_c_Log2            0.0771865955
## hs_cu_c_Log2            0.6071183261
## hs_hg_c_Log2           -0.0075730086
## hs_mn_c_Log2            .           
## hs_mo_c_Log2           -0.0992489424
## hs_pb_c_Log2           -0.0056257448
## hs_tl_cdich_None        .           
## hs_dde_cadj_Log2       -0.0378984008
## hs_ddt_cadj_Log2        .           
## hs_hcb_cadj_Log2        .           
## hs_pcb118_cadj_Log2     .           
## hs_pcb138_cadj_Log2     .           
## hs_pcb153_cadj_Log2    -0.1721262187
## hs_pcb170_cadj_Log2    -0.0557570999
## hs_pcb180_cadj_Log2     .           
## hs_dep_cadj_Log2       -0.0186165147
## hs_detp_cadj_Log2       .           
## hs_dmdtp_cdich_None     .           
## hs_dmp_cadj_Log2        .           
## hs_dmtp_cadj_Log2       .           
## hs_pbde153_cadj_Log2   -0.0357794002
## hs_pbde47_cadj_Log2     .           
## hs_pfhxs_c_Log2        -0.0019079468
## hs_pfna_c_Log2          .           
## hs_pfoa_c_Log2         -0.1360824261
## hs_pfos_c_Log2         -0.0478302901
## hs_pfunda_c_Log2        .           
## hs_bpa_cadj_Log2        .           
## hs_bupa_cadj_Log2       .           
## hs_etpa_cadj_Log2       .           
## hs_mepa_cadj_Log2       .           
## hs_oxbe_cadj_Log2       0.0008622765
## hs_prpa_cadj_Log2       0.0011728557
## hs_trcs_cadj_Log2       .           
## hs_mbzp_cadj_Log2       0.0373221816
## hs_mecpp_cadj_Log2      .           
## hs_mehhp_cadj_Log2      .           
## hs_mehp_cadj_Log2       .           
## hs_meohp_cadj_Log2      .           
## hs_mep_cadj_Log2        .           
## hs_mibp_cadj_Log2      -0.0477304169
## hs_mnbp_cadj_Log2      -0.0036235331
## hs_ohminp_cadj_Log2     .           
## hs_oxominp_cadj_Log2    .           
## hs_cotinine_cdich_None  .           
## hs_globalexp2_None      .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.231997

Ridge

# RIDGE
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)

plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

best_lambda <- fit_without_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
##                                   s1
## (Intercept)            -4.469806e+00
## hs_as_c_Log2            6.590433e-03
## hs_cd_c_Log2           -4.093355e-02
## hs_co_c_Log2           -5.049922e-02
## hs_cs_c_Log2            1.230373e-01
## hs_cu_c_Log2            6.078479e-01
## hs_hg_c_Log2           -3.225520e-02
## hs_mn_c_Log2           -3.089195e-02
## hs_mo_c_Log2           -1.068154e-01
## hs_pb_c_Log2           -5.295956e-02
## hs_tl_cdich_None        .           
## hs_dde_cadj_Log2       -4.888006e-02
## hs_ddt_cadj_Log2        4.045085e-03
## hs_hcb_cadj_Log2       -1.857150e-02
## hs_pcb118_cadj_Log2     1.400112e-02
## hs_pcb138_cadj_Log2    -3.614513e-02
## hs_pcb153_cadj_Log2    -1.223407e-01
## hs_pcb170_cadj_Log2    -5.267521e-02
## hs_pcb180_cadj_Log2    -1.074695e-02
## hs_dep_cadj_Log2       -2.548881e-02
## hs_detp_cadj_Log2       8.051621e-03
## hs_dmdtp_cdich_None     .           
## hs_dmp_cadj_Log2       -2.097690e-03
## hs_dmtp_cadj_Log2       7.300567e-05
## hs_pbde153_cadj_Log2   -3.315313e-02
## hs_pbde47_cadj_Log2     5.273953e-03
## hs_pfhxs_c_Log2        -2.966308e-02
## hs_pfna_c_Log2          2.336166e-02
## hs_pfoa_c_Log2         -1.519872e-01
## hs_pfos_c_Log2         -6.495855e-02
## hs_pfunda_c_Log2        1.248503e-02
## hs_bpa_cadj_Log2        3.832688e-04
## hs_bupa_cadj_Log2       6.588467e-03
## hs_etpa_cadj_Log2      -6.098679e-03
## hs_mepa_cadj_Log2      -1.638466e-02
## hs_oxbe_cadj_Log2       1.390524e-02
## hs_prpa_cadj_Log2       1.258510e-02
## hs_trcs_cadj_Log2       2.878805e-03
## hs_mbzp_cadj_Log2       5.550048e-02
## hs_mecpp_cadj_Log2      1.627174e-03
## hs_mehhp_cadj_Log2      2.316991e-02
## hs_mehp_cadj_Log2      -1.662304e-02
## hs_meohp_cadj_Log2      1.137436e-02
## hs_mep_cadj_Log2        3.371106e-03
## hs_mibp_cadj_Log2      -5.391219e-02
## hs_mnbp_cadj_Log2      -4.383016e-02
## hs_ohminp_cadj_Log2    -2.886768e-02
## hs_oxominp_cadj_Log2    2.204660e-02
## hs_cotinine_cdich_None  .           
## hs_globalexp2_None      .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.188752

Elastic Net

# ELASTIC NET
fit_without_covariates_train <- cv.glmnet(x_train_chemicals_only, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates_test <- predict(fit_without_covariates_train, s = "lambda.min", newx = x_test_chemicals_only)
test_mse_without_covariates <- mean((y_test - fit_without_covariates_test)^2)

plot(fit_without_covariates_train, xvar = "lambda", main = "Coefficients Path (Without Covariates)")

best_lambda <- fit_without_covariates_train$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates_train, s = best_lambda)
## 50 x 1 sparse Matrix of class "dgCMatrix"
##                                  s1
## (Intercept)            -4.785950188
## hs_as_c_Log2            .          
## hs_cd_c_Log2           -0.025843356
## hs_co_c_Log2           -0.005835867
## hs_cs_c_Log2            0.084715330
## hs_cu_c_Log2            0.607379616
## hs_hg_c_Log2           -0.009800093
## hs_mn_c_Log2            .          
## hs_mo_c_Log2           -0.099724922
## hs_pb_c_Log2           -0.010318890
## hs_tl_cdich_None        .          
## hs_dde_cadj_Log2       -0.039528137
## hs_ddt_cadj_Log2        .          
## hs_hcb_cadj_Log2        .          
## hs_pcb118_cadj_Log2     .          
## hs_pcb138_cadj_Log2     .          
## hs_pcb153_cadj_Log2    -0.169008355
## hs_pcb170_cadj_Log2    -0.055808065
## hs_pcb180_cadj_Log2     .          
## hs_dep_cadj_Log2       -0.019034348
## hs_detp_cadj_Log2       .          
## hs_dmdtp_cdich_None     .          
## hs_dmp_cadj_Log2        .          
## hs_dmtp_cadj_Log2       .          
## hs_pbde153_cadj_Log2   -0.035464586
## hs_pbde47_cadj_Log2     .          
## hs_pfhxs_c_Log2        -0.006816020
## hs_pfna_c_Log2          .          
## hs_pfoa_c_Log2         -0.135997766
## hs_pfos_c_Log2         -0.047692264
## hs_pfunda_c_Log2        .          
## hs_bpa_cadj_Log2        .          
## hs_bupa_cadj_Log2       .          
## hs_etpa_cadj_Log2       .          
## hs_mepa_cadj_Log2       .          
## hs_oxbe_cadj_Log2       0.002529961
## hs_prpa_cadj_Log2       0.001735800
## hs_trcs_cadj_Log2       .          
## hs_mbzp_cadj_Log2       0.040317847
## hs_mecpp_cadj_Log2      .          
## hs_mehhp_cadj_Log2      .          
## hs_mehp_cadj_Log2       .          
## hs_meohp_cadj_Log2      .          
## hs_mep_cadj_Log2        .          
## hs_mibp_cadj_Log2      -0.047892677
## hs_mnbp_cadj_Log2      -0.008483913
## hs_ohminp_cadj_Log2     .          
## hs_oxominp_cadj_Log2    .          
## hs_cotinine_cdich_None  .          
## hs_globalexp2_None      .
cat("Model without Covariates - Test MSE:", test_mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.228805
#selected chemicals that were noted in enet
chemicals_selected <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_detp_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_mepa_cadj_Log2",
  "hs_oxbe_cadj_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2")

The features for chemicals were selected due to the feature selections of elastic net. LASSO might simplify the dimensionality, so elastic net was chosen since feature importance is uncertain.

Postnatal Diet Data

Like with the chemical variables, the postnatal diet of children will be analyzed for the best variables using the regression methods.

LASSO

# LASSO with train/test
set.seed(101)  
train_indices <- sample(seq_len(nrow(diet_outcome_cov)), size = floor(0.7 * nrow(diet_outcome_cov)))
test_indices <- setdiff(seq_len(nrow(diet_outcome_cov)), train_indices)

diet_data <- diet_outcome_cov[, postnatal_diet]
x_diet_train <- model.matrix(~ . + 0, data = diet_data[train_indices, ])  
x_diet_test <- model.matrix(~ . + 0, data = diet_data[test_indices, ])  

covariates <- diet_outcome_cov[, c("e3_sex_None", "e3_yearbir_None", "hs_child_age_None")]
x_covariates_train <- model.matrix(~ . + 0, data = covariates[train_indices, ]) 
x_covariates_test <- model.matrix(~ . + 0, data = covariates[test_indices, ])

x_full_train <- cbind(x_diet_train, x_covariates_train)
x_full_test <- cbind(x_diet_test, x_covariates_test)

x_full_train[is.na(x_full_train)] <- 0
x_full_test[is.na(x_full_test)] <- 0
x_diet_train[is.na(x_diet_train)] <- 0
x_diet_test[is.na(x_diet_test)] <- 0

y_train <- as.numeric(diet_outcome_cov$hs_zbmi_who[train_indices])
y_test <- as.numeric(diet_outcome_cov$hs_zbmi_who[test_indices])

# fit models
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 1, family = "gaussian")
fit_without_covariates
## 
## Call:  cv.glmnet(x = x_diet_train, y = y_train, alpha = 1, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.06922     9   1.431 0.06022       5
## 1se 0.14570     1   1.442 0.06160       0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                         s1
## (Intercept)                     0.53256344
## h_bfdur_Ter(0,10.8]             .         
## h_bfdur_Ter(10.8,34.9]          .         
## h_bfdur_Ter(34.9,Inf]           .         
## hs_bakery_prod_Ter(2,6]         .         
## hs_bakery_prod_Ter(6,Inf]       .         
## hs_beverages_Ter(0.132,1]       .         
## hs_beverages_Ter(1,Inf]         .         
## hs_break_cer_Ter(1.1,5.5]       .         
## hs_break_cer_Ter(5.5,Inf]       .         
## hs_caff_drink_Ter(0.132,Inf]    .         
## hs_dairy_Ter(14.6,25.6]         .         
## hs_dairy_Ter(25.6,Inf]          .         
## hs_fastfood_Ter(0.132,0.5]      .         
## hs_fastfood_Ter(0.5,Inf]        .         
## hs_org_food_Ter(0.132,1]        .         
## hs_org_food_Ter(1,Inf]         -0.13588632
## hs_proc_meat_Ter(1.5,4]         .         
## hs_proc_meat_Ter(4,Inf]         .         
## hs_readymade_Ter(0.132,0.5]     .         
## hs_readymade_Ter(0.5,Inf]       .         
## hs_total_bread_Ter(7,17.5]      .         
## hs_total_bread_Ter(17.5,Inf]    .         
## hs_total_cereal_Ter(14.1,23.6]  .         
## hs_total_cereal_Ter(23.6,Inf]   .         
## hs_total_fish_Ter(1.5,3]        .         
## hs_total_fish_Ter(3,Inf]        .         
## hs_total_fruits_Ter(7,14.1]     .         
## hs_total_fruits_Ter(14.1,Inf]  -0.02481964
## hs_total_lipids_Ter(3,7]        .         
## hs_total_lipids_Ter(7,Inf]     -0.05164312
## hs_total_meat_Ter(6,9]          .         
## hs_total_meat_Ter(9,Inf]        .         
## hs_total_potatoes_Ter(3,4]      .         
## hs_total_potatoes_Ter(4,Inf]    .         
## hs_total_sweets_Ter(4.1,8.5]   -0.01594403
## hs_total_sweets_Ter(8.5,Inf]    .         
## hs_total_veg_Ter(6,8.5]         .         
## hs_total_veg_Ter(8.5,Inf]      -0.08180563
## hs_total_yog_Ter(6,8.5]         .         
## hs_total_yog_Ter(8.5,Inf]       .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.34942

Ridge

# RIDGE
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0, family = "gaussian")
fit_without_covariates
## 
## Call:  cv.glmnet(x = x_diet_train, y = y_train, alpha = 0, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure      SE Nonzero
## min   3.53    41   1.431 0.08497      40
## 1se 145.70     1   1.441 0.08233      40
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                     0.5163069457
## h_bfdur_Ter(0,10.8]            -0.0114164662
## h_bfdur_Ter(10.8,34.9]          0.0353770607
## h_bfdur_Ter(34.9,Inf]          -0.0138651651
## hs_bakery_prod_Ter(2,6]         0.0228606785
## hs_bakery_prod_Ter(6,Inf]      -0.0268639952
## hs_beverages_Ter(0.132,1]      -0.0065939314
## hs_beverages_Ter(1,Inf]        -0.0016124215
## hs_break_cer_Ter(1.1,5.5]      -0.0034207548
## hs_break_cer_Ter(5.5,Inf]      -0.0337182186
## hs_caff_drink_Ter(0.132,Inf]   -0.0143879393
## hs_dairy_Ter(14.6,25.6]         0.0355023507
## hs_dairy_Ter(25.6,Inf]         -0.0005581647
## hs_fastfood_Ter(0.132,0.5]      0.0161761119
## hs_fastfood_Ter(0.5,Inf]       -0.0001750742
## hs_org_food_Ter(0.132,1]        0.0151677373
## hs_org_food_Ter(1,Inf]         -0.0682466785
## hs_proc_meat_Ter(1.5,4]         0.0222199344
## hs_proc_meat_Ter(4,Inf]        -0.0187135643
## hs_readymade_Ter(0.132,0.5]    -0.0013536008
## hs_readymade_Ter(0.5,Inf]       0.0105115509
## hs_total_bread_Ter(7,17.5]     -0.0035702530
## hs_total_bread_Ter(17.5,Inf]   -0.0070550360
## hs_total_cereal_Ter(14.1,23.6]  0.0082269928
## hs_total_cereal_Ter(23.6,Inf]  -0.0131001584
## hs_total_fish_Ter(1.5,3]       -0.0346609367
## hs_total_fish_Ter(3,Inf]       -0.0051749487
## hs_total_fruits_Ter(7,14.1]     0.0266413533
## hs_total_fruits_Ter(14.1,Inf]  -0.0389551124
## hs_total_lipids_Ter(3,7]       -0.0022752284
## hs_total_lipids_Ter(7,Inf]     -0.0476627593
## hs_total_meat_Ter(6,9]          0.0007524275
## hs_total_meat_Ter(9,Inf]        0.0005196923
## hs_total_potatoes_Ter(3,4]      0.0105526823
## hs_total_potatoes_Ter(4,Inf]    0.0048180175
## hs_total_sweets_Ter(4.1,8.5]   -0.0392140671
## hs_total_sweets_Ter(8.5,Inf]   -0.0010028529
## hs_total_veg_Ter(6,8.5]         0.0009962184
## hs_total_veg_Ter(8.5,Inf]      -0.0556956882
## hs_total_yog_Ter(6,8.5]        -0.0102351610
## hs_total_yog_Ter(8.5,Inf]      -0.0089303177
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.326308

Elastic Net

#ELASTIC NET
fit_without_covariates <- cv.glmnet(x_diet_train, y_train, alpha = 0.5, family = "gaussian")
fit_without_covariates
## 
## Call:  cv.glmnet(x = x_diet_train, y = y_train, alpha = 0.5, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Index Measure      SE Nonzero
## min 0.07218    16   1.430 0.05641      12
## 1se 0.29139     1   1.444 0.05877       0
plot(fit_without_covariates, xvar = "lambda", main = "Coefficient Path (Without Covariates)")

best_lambda <- fit_without_covariates$lambda.min  # lambda that minimizes the MSE
coef(fit_without_covariates, s = best_lambda)
## 41 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                     0.650606526
## h_bfdur_Ter(0,10.8]             .          
## h_bfdur_Ter(10.8,34.9]          0.039832328
## h_bfdur_Ter(34.9,Inf]           .          
## hs_bakery_prod_Ter(2,6]         .          
## hs_bakery_prod_Ter(6,Inf]      -0.052635590
## hs_beverages_Ter(0.132,1]       .          
## hs_beverages_Ter(1,Inf]         .          
## hs_break_cer_Ter(1.1,5.5]       .          
## hs_break_cer_Ter(5.5,Inf]      -0.054788470
## hs_caff_drink_Ter(0.132,Inf]    .          
## hs_dairy_Ter(14.6,25.6]         0.053455833
## hs_dairy_Ter(25.6,Inf]          .          
## hs_fastfood_Ter(0.132,0.5]      .          
## hs_fastfood_Ter(0.5,Inf]        .          
## hs_org_food_Ter(0.132,1]        .          
## hs_org_food_Ter(1,Inf]         -0.185235916
## hs_proc_meat_Ter(1.5,4]         0.008558872
## hs_proc_meat_Ter(4,Inf]         .          
## hs_readymade_Ter(0.132,0.5]     .          
## hs_readymade_Ter(0.5,Inf]       .          
## hs_total_bread_Ter(7,17.5]      .          
## hs_total_bread_Ter(17.5,Inf]    .          
## hs_total_cereal_Ter(14.1,23.6]  .          
## hs_total_cereal_Ter(23.6,Inf]   .          
## hs_total_fish_Ter(1.5,3]       -0.057540803
## hs_total_fish_Ter(3,Inf]        .          
## hs_total_fruits_Ter(7,14.1]     0.017171763
## hs_total_fruits_Ter(14.1,Inf]  -0.054914989
## hs_total_lipids_Ter(3,7]        .          
## hs_total_lipids_Ter(7,Inf]     -0.094342286
## hs_total_meat_Ter(6,9]          .          
## hs_total_meat_Ter(9,Inf]        .          
## hs_total_potatoes_Ter(3,4]      .          
## hs_total_potatoes_Ter(4,Inf]    .          
## hs_total_sweets_Ter(4.1,8.5]   -0.089860153
## hs_total_sweets_Ter(8.5,Inf]    .          
## hs_total_veg_Ter(6,8.5]         .          
## hs_total_veg_Ter(8.5,Inf]      -0.118161721
## hs_total_yog_Ter(6,8.5]         .          
## hs_total_yog_Ter(8.5,Inf]       .
predictions_without_covariates <- predict(fit_without_covariates, s = "lambda.min", newx = x_diet_test)
mse_without_covariates <- mean((y_test - predictions_without_covariates)^2)

cat("Model without Covariates - Test MSE:", mse_without_covariates, "\n")
## Model without Covariates - Test MSE: 1.335144
#selected diets that were noted in enet
diet_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

Elastic net was selected for variable importance since not sure of which features will be considerably important compared to LASSO.

Finalized Data

With these now selected variables, the analysis will build up from covariates and adding the other variables. Froze the covariates in lasso, ridge and enet to avoid shrinkage.

#selected chemicals that were noted in enet
chemicals_selected <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_detp_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_mepa_cadj_Log2",
  "hs_oxbe_cadj_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2")
#selected diets that were noted in enet
diet_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)
combined_data_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter",
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

outcome_cov <- cbind(covariate_data, outcome_BMI)
outcome_cov <- outcome_cov[, !duplicated(colnames(outcome_cov))]

finalized_columns <- c(combined_data_selected)
final_selected_data <- exposome %>% dplyr::select(all_of(finalized_columns))

finalized_data <- cbind(outcome_cov, final_selected_data)
head(finalized_data)
numeric_finalized <- finalized_data %>%
  dplyr::select(where(is.numeric))

cor_matrix <- cor(numeric_finalized, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 90, tl.cex = 0.6)

find_highly_correlated <- function(cor_matrix, threshold = 0.8) {
  cor_matrix[lower.tri(cor_matrix, diag = TRUE)] <- NA  
  cor_matrix <- as.data.frame(as.table(cor_matrix)) 
  cor_matrix <- na.omit(cor_matrix)  
  cor_matrix <- cor_matrix[order(-abs(cor_matrix$Freq)), ]  
  cor_matrix <- cor_matrix %>% filter(abs(Freq) > threshold)  
  return(cor_matrix)
}

highly_correlated_pairs <- find_highly_correlated(cor_matrix, threshold = 0.50)
highly_correlated_pairs
set.seed(101)

# Splitting data into training and test sets
train_indices <- sample(seq_len(nrow(finalized_data)), size = floor(0.7 * nrow(finalized_data)))
test_indices <- setdiff(seq_len(nrow(finalized_data)), train_indices)

# Creating training and test datasets
train_data <- finalized_data[train_indices, ]
test_data <- finalized_data[test_indices, ]

# Separating predictors and outcome variable
x_train <- model.matrix(~ . + 0, data = train_data[ , !names(train_data) %in% "hs_zbmi_who"])
x_test <- model.matrix(~ . + 0, data = test_data[ , !names(test_data) %in% "hs_zbmi_who"])
y_train <- train_data$hs_zbmi_who
y_test <- test_data$hs_zbmi_who

covariates_selected <- c("hs_child_age_None", "h_cohort", "e3_sex_None", "e3_yearbir_None")

#to freeze the covariates and make sure they are not shrinked
penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

Baseline (Covariates)

covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
baseline_data <- finalized_data %>% dplyr::select(c(covariates_selected, "hs_zbmi_who"))

x <- model.matrix(~ . -1, data = baseline_data[ , !names(baseline_data) %in% "hs_zbmi_who"])
y <- baseline_data$hs_zbmi_who

set.seed(101)
trainIndex <- createDataPartition(y, p = .7, list = FALSE, times = 1)
x_train <- x[trainIndex, ]
y_train <- y[trainIndex]
x_test <- x[-trainIndex, ]
y_test <- y[-trainIndex]

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

LASSO

set.seed(101)
fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)

coef(fit_lasso)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                              s1
## (Intercept)          0.87326347
## hs_child_age_None   -0.05738585
## e3_sex_Nonefemale    .         
## e3_sex_Nonemale      .         
## e3_yearbir_None2004  .         
## e3_yearbir_None2005  .         
## e3_yearbir_None2006  .         
## e3_yearbir_None2007  .         
## e3_yearbir_None2008  .         
## e3_yearbir_None2009  .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)

cat("Baseline Lasso MSE:", mse_lasso, "\n")
## Baseline Lasso MSE: 1.327143
cat("Baseline Lasso RMSE:", rmse_lasso, "\n")
## Baseline Lasso RMSE: 1.152017

Ridge

set.seed(101)
fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)

coef(fit_ridge)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                                s1
## (Intercept)          8.732635e-01
## hs_child_age_None   -5.738585e-02
## e3_sex_Nonefemale   -3.110211e-38
## e3_sex_Nonemale      3.110211e-38
## e3_yearbir_None2004 -1.463782e-37
## e3_yearbir_None2005  2.704907e-38
## e3_yearbir_None2006 -9.332751e-39
## e3_yearbir_None2007 -5.445405e-38
## e3_yearbir_None2008  3.129795e-38
## e3_yearbir_None2009  3.620840e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)

cat("Baseline Ridge MSE:", mse_ridge, "\n")
## Baseline Ridge MSE: 1.327143
cat("Baseline Ridge RMSE:", rmse_ridge, "\n")
## Baseline Ridge RMSE: 1.152017

Elastic Net

set.seed(101)
fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)

coef(fit_enet)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                              s1
## (Intercept)          0.87326347
## hs_child_age_None   -0.05738585
## e3_sex_Nonefemale    .         
## e3_sex_Nonemale      .         
## e3_yearbir_None2004  .         
## e3_yearbir_None2005  .         
## e3_yearbir_None2006  .         
## e3_yearbir_None2007  .         
## e3_yearbir_None2008  .         
## e3_yearbir_None2009  .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)

cat("Baseline Elastic Net MSE:", mse_enet, "\n")
## Baseline Elastic Net MSE: 1.327143
cat("Baseline Elastic Net RMSE:", rmse_enet, "\n")
## Baseline Elastic Net RMSE: 1.152017

Decision Tree

set.seed(101)
trainIndex <- createDataPartition(baseline_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- baseline_data[trainIndex, ]
test_data <- baseline_data[-trainIndex, ]

fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

printcp(fit_tree)
## 
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
## [1] hs_child_age_None
## 
## Root node error: 1328.8/913 = 1.4554
## 
## n= 913 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.011581      0   1.00000 1.0026 0.051032
## 2 0.010000      1   0.98842 1.0130 0.051854
plotcp(fit_tree)

optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]

pruned_tree <- prune(fit_tree, cp = optimal_cp)

rpart.plot(pruned_tree)

pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Baseline Pruned Decision Tree MSE:", mse_pruned_tree, "\n")
## Baseline Pruned Decision Tree MSE: 1.319431
cat("Baseline Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Baseline Pruned Decision Tree RMSE: 1.148665

The pruned decision tree model has an RMSE of 1.148665 on the test data. This suggests that, on average, the predictions made by the model are off by about 1.15 BMI Z-score units. Since the decision tree used only one variable (hs_child_age_None), it suggests that this variable is a significant predictor of hs_zbmi_who in your dataset. However, the performance metrics indicate that there may be room for improvement, potentially by using more complex models or incorporating additional relevant features.

Random Forest

fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500, importance = TRUE)
varImpPlot(fit_rf)

rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Baseline Random Forest MSE:", mse_rf, "\n")
## Baseline Random Forest MSE: 1.342326
cat("Baseline Random Forest RMSE:", rmse_rf, "\n")
## Baseline Random Forest RMSE: 1.158588

Importance plots show that age and year of birth are the most influential variables.

GBM

fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Baseline GBM MSE:", mse_gbm, "\n")
## Baseline GBM MSE: 1.320142
cat("Baseline GBM RMSE:", rmse_gbm, "\n")
## Baseline GBM RMSE: 1.148974

Model Comparison:

The results indicate that Ridge Regression performs slightly better than Lasso and Elastic Net in terms of MSE and RMSE.

The Decision Tree model has a slightly lower MSE (1.319) and RMSE (1.149) compared to Lasso, Ridge, and Elastic Net, indicating marginally better performance.

The Random Forest model has the highest MSE (1.342) and RMSE (1.159) among the models, suggesting that it may not be the best choice for this particular dataset.

The GBM model has similar performance to the Decision Tree, with an MSE of 1.320 and an RMSE of 1.149.

The differences in performance are small, indicating that none of the models significantly outperforms the others on the baseline data.

Diet + Covariates

diet_selected <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_break_cer_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_proc_meat_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")

diet_data <- finalized_data %>% dplyr::select(c(covariates_selected, diet_selected, "hs_zbmi_who"))

set.seed(101)
trainIndex <- createDataPartition(diet_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- diet_data[trainIndex, ]
test_data <- diet_data[-trainIndex, ]

x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

LASSO

fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)

coef(fit_lasso)
## 34 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                    0.87326347
## hs_child_age_None             -0.05738585
## e3_sex_Nonefemale              .         
## e3_sex_Nonemale                .         
## e3_yearbir_None2004            .         
## e3_yearbir_None2005            .         
## e3_yearbir_None2006            .         
## e3_yearbir_None2007            .         
## e3_yearbir_None2008            .         
## e3_yearbir_None2009            .         
## h_bfdur_Ter(10.8,34.9]         .         
## h_bfdur_Ter(34.9,Inf]          .         
## hs_bakery_prod_Ter(2,6]        .         
## hs_bakery_prod_Ter(6,Inf]      .         
## hs_break_cer_Ter(1.1,5.5]      .         
## hs_break_cer_Ter(5.5,Inf]      .         
## hs_dairy_Ter(14.6,25.6]        .         
## hs_dairy_Ter(25.6,Inf]         .         
## hs_fastfood_Ter(0.132,0.5]     .         
## hs_fastfood_Ter(0.5,Inf]       .         
## hs_org_food_Ter(0.132,1]       .         
## hs_org_food_Ter(1,Inf]         .         
## hs_proc_meat_Ter(1.5,4]        .         
## hs_proc_meat_Ter(4,Inf]        .         
## hs_total_fish_Ter(1.5,3]       .         
## hs_total_fish_Ter(3,Inf]       .         
## hs_total_fruits_Ter(7,14.1]    .         
## hs_total_fruits_Ter(14.1,Inf]  .         
## hs_total_lipids_Ter(3,7]       .         
## hs_total_lipids_Ter(7,Inf]     .         
## hs_total_sweets_Ter(4.1,8.5]   .         
## hs_total_sweets_Ter(8.5,Inf]   .         
## hs_total_veg_Ter(6,8.5]        .         
## hs_total_veg_Ter(8.5,Inf]      .
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)

cat("Diet + Covariates Lasso MSE:", mse_lasso, "\n")
## Diet + Covariates Lasso MSE: 1.296766
cat("Diet + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Diet + Covariates Lasso RMSE: 1.138756

Ridge

fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)

coef(fit_ridge)
## 34 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                    8.732635e-01
## hs_child_age_None             -5.738585e-02
## e3_sex_Nonefemale             -3.392957e-38
## e3_sex_Nonemale                3.392957e-38
## e3_yearbir_None2004           -1.596853e-37
## e3_yearbir_None2005            2.950807e-38
## e3_yearbir_None2006           -1.018118e-38
## e3_yearbir_None2007           -5.940442e-38
## e3_yearbir_None2008            3.414322e-38
## e3_yearbir_None2009            3.950007e-37
## h_bfdur_Ter(10.8,34.9]         1.959443e-37
## h_bfdur_Ter(34.9,Inf]         -1.196150e-37
## hs_bakery_prod_Ter(2,6]        2.102760e-37
## hs_bakery_prod_Ter(6,Inf]     -1.896677e-37
## hs_break_cer_Ter(1.1,5.5]      9.801144e-39
## hs_break_cer_Ter(5.5,Inf]     -2.286856e-37
## hs_dairy_Ter(14.6,25.6]        5.634165e-38
## hs_dairy_Ter(25.6,Inf]         8.216707e-41
## hs_fastfood_Ter(0.132,0.5]     1.215262e-37
## hs_fastfood_Ter(0.5,Inf]      -6.235230e-38
## hs_org_food_Ter(0.132,1]       8.146218e-38
## hs_org_food_Ter(1,Inf]        -1.900266e-37
## hs_proc_meat_Ter(1.5,4]        1.193171e-37
## hs_proc_meat_Ter(4,Inf]       -1.994408e-38
## hs_total_fish_Ter(1.5,3]      -8.902720e-38
## hs_total_fish_Ter(3,Inf]       2.141752e-38
## hs_total_fruits_Ter(7,14.1]    2.168036e-37
## hs_total_fruits_Ter(14.1,Inf] -2.040600e-37
## hs_total_lipids_Ter(3,7]      -9.170321e-39
## hs_total_lipids_Ter(7,Inf]    -2.153479e-37
## hs_total_sweets_Ter(4.1,8.5]  -7.677763e-38
## hs_total_sweets_Ter(8.5,Inf]  -1.716694e-38
## hs_total_veg_Ter(6,8.5]        2.132182e-38
## hs_total_veg_Ter(8.5,Inf]     -1.282604e-37
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)

cat("Diet + Covariates Ridge MSE:", mse_ridge, "\n")
## Diet + Covariates Ridge MSE: 1.288063
cat("Diet + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Diet + Covariates Ridge RMSE: 1.134929

Elastic Net

fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)

coef(fit_enet)
## 34 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                    0.87326347
## hs_child_age_None             -0.05738585
## e3_sex_Nonefemale              .         
## e3_sex_Nonemale                .         
## e3_yearbir_None2004            .         
## e3_yearbir_None2005            .         
## e3_yearbir_None2006            .         
## e3_yearbir_None2007            .         
## e3_yearbir_None2008            .         
## e3_yearbir_None2009            .         
## h_bfdur_Ter(10.8,34.9]         .         
## h_bfdur_Ter(34.9,Inf]          .         
## hs_bakery_prod_Ter(2,6]        .         
## hs_bakery_prod_Ter(6,Inf]      .         
## hs_break_cer_Ter(1.1,5.5]      .         
## hs_break_cer_Ter(5.5,Inf]      .         
## hs_dairy_Ter(14.6,25.6]        .         
## hs_dairy_Ter(25.6,Inf]         .         
## hs_fastfood_Ter(0.132,0.5]     .         
## hs_fastfood_Ter(0.5,Inf]       .         
## hs_org_food_Ter(0.132,1]       .         
## hs_org_food_Ter(1,Inf]         .         
## hs_proc_meat_Ter(1.5,4]        .         
## hs_proc_meat_Ter(4,Inf]        .         
## hs_total_fish_Ter(1.5,3]       .         
## hs_total_fish_Ter(3,Inf]       .         
## hs_total_fruits_Ter(7,14.1]    .         
## hs_total_fruits_Ter(14.1,Inf]  .         
## hs_total_lipids_Ter(3,7]       .         
## hs_total_lipids_Ter(7,Inf]     .         
## hs_total_sweets_Ter(4.1,8.5]   .         
## hs_total_sweets_Ter(8.5,Inf]   .         
## hs_total_veg_Ter(6,8.5]        .         
## hs_total_veg_Ter(8.5,Inf]      .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)

cat("Diet + Covariates Elastic Net MSE:", mse_enet, "\n")
## Diet + Covariates Elastic Net MSE: 1.295066
cat("Diet + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Diet + Covariates Elastic Net RMSE: 1.13801

Ridge Regression continues to perform well with the addition of diet variables, maintaining its position as one of the best regularization techniques.

Decision Tree

fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

printcp(fit_tree)
## 
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
## [1] hs_bakery_prod_Ter  hs_break_cer_Ter    hs_child_age_None  
## [4] hs_total_lipids_Ter
## 
## Root node error: 1328.8/913 = 1.4554
## 
## n= 913 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.011773      0   1.00000 1.0033 0.050928
## 2 0.010484      3   0.96468 1.0157 0.051237
## 3 0.010000      4   0.95420 1.0275 0.053129
plotcp(fit_tree)

optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]

pruned_tree <- prune(fit_tree, cp = optimal_cp)

rpart.plot(pruned_tree)

pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Pruned Decision Tree MSE:", mse_pruned_tree, "\n")
## Pruned Decision Tree MSE: 1.319431
cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.148665

The top predictor is child age of above 6 and a half years, followed by bakery products and breakfast cereals in the unpruned tree.

Random Forest

set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data)
varImpPlot(fit_rf)

importance(fit_rf)
##                     IncNodePurity
## hs_child_age_None       233.57368
## e3_sex_None              34.19375
## e3_yearbir_None          97.75201
## h_bfdur_Ter              65.45676
## hs_bakery_prod_Ter       64.03655
## hs_break_cer_Ter         65.99478
## hs_dairy_Ter             69.72797
## hs_fastfood_Ter          55.45883
## hs_org_food_Ter          65.89629
## hs_proc_meat_Ter         68.85280
## hs_total_fish_Ter        63.72114
## hs_total_fruits_Ter      64.85756
## hs_total_lipids_Ter      65.79002
## hs_total_sweets_Ter      66.74543
## hs_total_veg_Ter         69.89130
rf_predictions <- predict(fit_rf, newdata = test_data)
mse_rf <- mean((test_data$hs_zbmi_who - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Diet + Covariates Random Forest MSE:", mse_rf, "\n")
## Diet + Covariates Random Forest MSE: 1.299369
cat("Diet + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Covariates Random Forest RMSE: 1.139899

The Random Forest model shows a slight improvement when diet variables are included. Age, year of birth, and a diet of vegetables have the most importance.

GBM

set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 1000, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5, verbose = FALSE)
summary(fit_gbm)
best_trees <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_trees)
mse_gbm <- mean((test_data$hs_zbmi_who - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Diet + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Covariates GBM MSE: 1.294869
cat("Diet + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Covariates GBM RMSE: 1.137923

GBM shows a significant improvement with diet variables, indicating its effectiveness in leveraging additional predictors for better performance.

Model Comparison:

For Lasso, Ridge, Elastic Net, and GBM models, the inclusion of diet-related features and other covariates resulted in a slight improvement in MSE and RMSE. This suggests that adding these additional features helps the models to better predict the outcome variable (BMI Z-score).

The Decision Tree model showed almost no change in performance with the addition of diet-related features and other covariates.

The Ridge Regression model had the lowest MSE (1.288) and RMSE (1.135) with the diet and covariates features, indicating it is the best performing model among those compared.

Both Random Forest and GBM models showed a notable improvement with the inclusion of diet-related features and covariates. The Random Forest model’s RMSE decreased from 1.159 to 1.140, and the GBM model’s RMSE decreased from 1.149 to 1.138.

Chemicals + Covariates

chemicals_selected <- c(
  "hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
  "hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

chemical_data <- finalized_data %>% dplyr::select(c(covariates_selected, chemicals_selected, "hs_zbmi_who"))

set.seed(101)
trainIndex <- createDataPartition(chemical_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_data[trainIndex, ]
test_data <- chemical_data[-trainIndex, ]

x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

LASSO

fit_lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_lasso)

coef(fit_lasso)
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                s1
## (Intercept)          -1.201642945
## hs_child_age_None    -0.042729967
## e3_sex_Nonefemale     .          
## e3_sex_Nonemale       .          
## e3_yearbir_None2004  -0.040205226
## e3_yearbir_None2005   .          
## e3_yearbir_None2006   .          
## e3_yearbir_None2007   .          
## e3_yearbir_None2008   .          
## e3_yearbir_None2009   .          
## hs_cd_c_Log2         -0.011319810
## hs_co_c_Log2          .          
## hs_cs_c_Log2          .          
## hs_cu_c_Log2          0.260654501
## hs_hg_c_Log2          .          
## hs_mo_c_Log2         -0.069199903
## hs_pb_c_Log2          .          
## hs_dde_cadj_Log2     -0.041339092
## hs_pcb153_cadj_Log2  -0.125818759
## hs_pcb170_cadj_Log2  -0.043143964
## hs_dep_cadj_Log2     -0.003677983
## hs_pbde153_cadj_Log2 -0.036720000
## hs_pfhxs_c_Log2       .          
## hs_pfoa_c_Log2       -0.097979410
## hs_pfos_c_Log2        .          
## hs_prpa_cadj_Log2     .          
## hs_mbzp_cadj_Log2     0.007077421
## hs_mibp_cadj_Log2    -0.025837249
## hs_mnbp_cadj_Log2    -0.004654193
lasso_predictions <- predict(fit_lasso, s = "lambda.min", newx = x_test)
mse_lasso <- mean((y_test - lasso_predictions)^2)
rmse_lasso <- sqrt(mse_lasso)

cat("Chemical + Covariates Lasso MSE:", mse_lasso, "\n")
## Chemical + Covariates Lasso MSE: 1.082759
cat("Chemical + Covariates Lasso RMSE:", rmse_lasso, "\n")
## Chemical + Covariates Lasso RMSE: 1.040557

Ridge

fit_ridge <- cv.glmnet(x_train, y_train, alpha = 0, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_ridge)

coef(fit_ridge)
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                s1
## (Intercept)          -1.017064370
## hs_child_age_None    -0.050364241
## e3_sex_Nonefemale    -0.016862646
## e3_sex_Nonemale       0.016857713
## e3_yearbir_None2004  -0.088001228
## e3_yearbir_None2005   0.019239251
## e3_yearbir_None2006   0.031700788
## e3_yearbir_None2007  -0.021259447
## e3_yearbir_None2008  -0.006970817
## e3_yearbir_None2009   0.164971168
## hs_cd_c_Log2         -0.028107768
## hs_co_c_Log2         -0.021526809
## hs_cs_c_Log2          0.044750806
## hs_cu_c_Log2          0.241316062
## hs_hg_c_Log2         -0.009375312
## hs_mo_c_Log2         -0.052460551
## hs_pb_c_Log2         -0.025851868
## hs_dde_cadj_Log2     -0.038109739
## hs_pcb153_cadj_Log2  -0.092130426
## hs_pcb170_cadj_Log2  -0.028327506
## hs_dep_cadj_Log2     -0.008871966
## hs_pbde153_cadj_Log2 -0.021002524
## hs_pfhxs_c_Log2      -0.023072685
## hs_pfoa_c_Log2       -0.083415479
## hs_pfos_c_Log2       -0.024508537
## hs_prpa_cadj_Log2     0.001511971
## hs_mbzp_cadj_Log2     0.025157058
## hs_mibp_cadj_Log2    -0.024449722
## hs_mnbp_cadj_Log2    -0.030987317
ridge_predictions <- predict(fit_ridge, s = "lambda.min", newx = x_test)
mse_ridge <- mean((y_test - ridge_predictions)^2)
rmse_ridge <- sqrt(mse_ridge)

cat("Chemical + Covariates Ridge MSE:", mse_ridge, "\n")
## Chemical + Covariates Ridge MSE: 1.072508
cat("Chemical + Covariates Ridge RMSE:", rmse_ridge, "\n")
## Chemical + Covariates Ridge RMSE: 1.035619

Elastic Net

fit_enet <- cv.glmnet(x_train, y_train, alpha = 0.5, family = "gaussian", penalty.factor = penalty_factors)
plot(fit_enet)

coef(fit_enet)
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                s1
## (Intercept)          -0.667026013
## hs_child_age_None    -0.039495873
## e3_sex_Nonefemale     .          
## e3_sex_Nonemale       .          
## e3_yearbir_None2004   .          
## e3_yearbir_None2005   .          
## e3_yearbir_None2006   .          
## e3_yearbir_None2007   .          
## e3_yearbir_None2008   .          
## e3_yearbir_None2009   .          
## hs_cd_c_Log2         -0.001139167
## hs_co_c_Log2          .          
## hs_cs_c_Log2          .          
## hs_cu_c_Log2          0.189921157
## hs_hg_c_Log2          .          
## hs_mo_c_Log2         -0.050667701
## hs_pb_c_Log2          .          
## hs_dde_cadj_Log2     -0.030158657
## hs_pcb153_cadj_Log2  -0.119215474
## hs_pcb170_cadj_Log2  -0.038236462
## hs_dep_cadj_Log2      .          
## hs_pbde153_cadj_Log2 -0.032755305
## hs_pfhxs_c_Log2       .          
## hs_pfoa_c_Log2       -0.077859529
## hs_pfos_c_Log2        .          
## hs_prpa_cadj_Log2     .          
## hs_mbzp_cadj_Log2     .          
## hs_mibp_cadj_Log2    -0.006808704
## hs_mnbp_cadj_Log2     .
enet_predictions <- predict(fit_enet, s = "lambda.min", newx = x_test)
mse_enet <- mean((y_test - enet_predictions)^2)
rmse_enet <- sqrt(mse_enet)

cat("Chemical + Covariates Elastic Net MSE:", mse_enet, "\n")
## Chemical + Covariates Elastic Net MSE: 1.08233
cat("Chemical + Covariates Elastic Net RMSE:", rmse_enet, "\n")
## Chemical + Covariates Elastic Net RMSE: 1.040351

Decision Tree

fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

printcp(fit_tree)
## 
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2         hs_dde_cadj_Log2     hs_mbzp_cadj_Log2   
## [4] hs_mnbp_cadj_Log2    hs_pbde153_cadj_Log2 hs_pcb170_cadj_Log2 
## [7] hs_pfoa_c_Log2      
## 
## Root node error: 1328.8/913 = 1.4554
## 
## n= 913 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.077388      0   1.00000 1.00330 0.050928
## 2  0.032769      1   0.92261 0.93423 0.048425
## 3  0.032579      2   0.88984 0.92796 0.047448
## 4  0.025008      3   0.85726 0.90155 0.045688
## 5  0.014062      4   0.83226 0.89664 0.046321
## 6  0.013674      6   0.80413 0.90037 0.048102
## 7  0.013602      7   0.79046 0.89910 0.048133
## 8  0.013517      8   0.77686 0.89738 0.047637
## 9  0.011314      9   0.76334 0.89439 0.046507
## 10 0.011047     11   0.74071 0.89261 0.046811
## 11 0.010329     12   0.72967 0.89860 0.047033
## 12 0.010000     14   0.70901 0.90899 0.047783
plotcp(fit_tree)

optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]

pruned_tree <- prune(fit_tree, cp = optimal_cp)

rpart.plot(pruned_tree)

pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Pruned Decision Tree MSE:", mse_pruned_tree, "\n")
## Pruned Decision Tree MSE: 1.210403
cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.100183

The pruned decision tree diagram illustrates how the model makes splits based on chemical exposure variables like hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, hs_dde_cadj_Log2, and hs_mnbp_cadj_Log2.

The first split is on hs_pcb170_cadj_Log2 with a threshold of 1.2, indicating that this chemical exposure level is a critical factor in predicting BMI Z-score.

Subsequent splits further refine the predictions based on other chemical exposures, highlighting the importance of these factors in the model.

Random Forest

fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
rf_predictions <- predict(fit_rf, newdata = test_data)
varImpPlot(fit_rf)

importance(fit_rf)
##                      IncNodePurity
## hs_child_age_None        47.187198
## e3_sex_None               5.878084
## e3_yearbir_None          35.112485
## hs_cd_c_Log2             51.795652
## hs_co_c_Log2             51.717642
## hs_cs_c_Log2             40.518806
## hs_cu_c_Log2             68.458209
## hs_hg_c_Log2             51.566383
## hs_mo_c_Log2             60.691194
## hs_pb_c_Log2             48.070852
## hs_dde_cadj_Log2         78.539452
## hs_pcb153_cadj_Log2      85.795111
## hs_pcb170_cadj_Log2     132.855527
## hs_dep_cadj_Log2         55.605309
## hs_pbde153_cadj_Log2     98.811874
## hs_pfhxs_c_Log2          45.792625
## hs_pfoa_c_Log2           61.349673
## hs_pfos_c_Log2           51.462842
## hs_prpa_cadj_Log2        51.557275
## hs_mbzp_cadj_Log2        57.029739
## hs_mibp_cadj_Log2        45.097714
## hs_mnbp_cadj_Log2        52.899049
mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Chemical + Covariates Random Forest MSE:", mse_rf, "\n")
## Chemical + Covariates Random Forest MSE: 1.062342
cat("Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Chemical + Covariates Random Forest RMSE: 1.0307

The features such as hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, and hs_dde_cadj_Log2 are among the top predictors, indicating that these chemical exposures are significant contributors to the model’s predictions.

GBM

fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Chemical + Covariates GBM MSE:", mse_gbm, "\n")
## Chemical + Covariates GBM MSE: 1.150933
cat("Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Chemical + Covariates GBM RMSE: 1.072815

The significant reduction in MSE and RMSE with the inclusion of chemical exposures indicates that these factors are crucial for accurately predicting BMI Z-score. The GBM model benefits from these additional features, highlighting their predictive power.

Diet + Chemical + Covariates

covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")
diet_selected <- c(
  "h_bfdur_Ter", "hs_bakery_prod_Ter", "hs_break_cer_Ter", "hs_dairy_Ter",
  "hs_fastfood_Ter", "hs_org_food_Ter", "hs_proc_meat_Ter", "hs_total_fish_Ter",
  "hs_total_fruits_Ter", "hs_total_lipids_Ter", "hs_total_sweets_Ter", "hs_total_veg_Ter"
)
chemicals_selected <- c(
  "hs_cd_c_Log2", "hs_co_c_Log2", "hs_cs_c_Log2", "hs_cu_c_Log2",
  "hs_hg_c_Log2", "hs_mo_c_Log2", "hs_pb_c_Log2", "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2", "hs_pcb170_cadj_Log2", "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2", "hs_pfhxs_c_Log2", "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2", "hs_prpa_cadj_Log2", "hs_mbzp_cadj_Log2", "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

combined_selected <- c(covariates_selected, diet_selected, chemicals_selected)

chemical_diet_cov_data <- finalized_data %>% dplyr::select(all_of(c(combined_selected, "hs_zbmi_who")))

set.seed(101)
trainIndex <- createDataPartition(chemical_diet_cov_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
train_data <- chemical_diet_cov_data[trainIndex,]
test_data <- chemical_diet_cov_data[-trainIndex,]

x_train <- model.matrix(hs_zbmi_who ~ . -1, train_data)
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, test_data)
y_test <- test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)

# make the group_indices vector
group_indices <- c(
  rep(1, num_covariates),     # Group 1: Covariates
  rep(2, num_diet),           # Group 2: Diet
  rep(3, num_chemicals)       # Group 3: Chemicals
)

# adjust length if needed
if (length(group_indices) < ncol(x_train)) {
  group_indices <- c(group_indices, rep(4, ncol(x_train) - length(group_indices)))
}

length(group_indices) == ncol(x_train)  # should be TRUE
## [1] TRUE

Group LASSO

set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso",, penalty.factor = penalty_factors, family = "gaussian")

group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")

mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)

cat("Group Lasso MSE:", mse_group_lasso, "\n")
## Group Lasso MSE: 1.086867
cat("Group Lasso RMSE:", rmse_group_lasso, "\n")
## Group Lasso RMSE: 1.042529

Group Ridge

set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")

group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")

mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)

cat("Group Ridge MSE:", mse_group_ridge, "\n")
## Group Ridge MSE: 1.085814
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 1.042024

Group Elastic Net

set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")

group_enet_predictions <- predict(group_enet_model, x_test, type = "response")

mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)

cat("Group Elastic Net MSE:", mse_group_enet, "\n")
## Group Elastic Net MSE: 1.088671
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 1.043394

Decision Tree

set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

printcp(fit_tree)
## 
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
## [1] hs_cu_c_Log2         hs_dde_cadj_Log2     hs_fastfood_Ter     
## [4] hs_mnbp_cadj_Log2    hs_pbde153_cadj_Log2 hs_pcb170_cadj_Log2 
## [7] hs_pfoa_c_Log2      
## 
## Root node error: 1328.8/913 = 1.4554
## 
## n= 913 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.077388      0   1.00000 1.00089 0.050861
## 2  0.032769      1   0.92261 0.93652 0.048377
## 3  0.032579      2   0.88984 0.91682 0.046750
## 4  0.025008      3   0.85726 0.91218 0.046570
## 5  0.015671      4   0.83226 0.90443 0.045712
## 6  0.013674      5   0.81658 0.91485 0.047377
## 7  0.013602      6   0.80291 0.91478 0.047609
## 8  0.013517      7   0.78931 0.91478 0.047609
## 9  0.011314      8   0.77579 0.93397 0.049086
## 10 0.011047     10   0.75317 0.96270 0.050245
## 11 0.010296     11   0.74212 0.98215 0.050293
## 12 0.010000     12   0.73182 0.98607 0.050369
plotcp(fit_tree)

optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]

pruned_tree <- prune(fit_tree, cp = optimal_cp)

rpart.plot(pruned_tree)

pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Pruned Decision Tree MSE:", mse_pruned_tree, "\n")
## Pruned Decision Tree MSE: 1.187546
cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.089746

The pruned decision tree diagram illustrates how the model splits based on chemical exposure variables like hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, hs_dde_cadj_Log2, and diet-related variables.

Key splits in the tree emphasize the importance of these exposures and dietary factors in predicting BMI Z-score, with the first split on hs_pcb170_cadj_Log2 at 1.2 being particularly important.

Random Forest

set.seed(101)
fit_rf <- randomForest(hs_zbmi_who ~ ., data = train_data, ntree = 500)
varImpPlot(fit_rf)

importance(fit_rf)
##                      IncNodePurity
## hs_child_age_None        41.267546
## e3_sex_None               5.540866
## e3_yearbir_None          31.564485
## h_bfdur_Ter              13.136212
## hs_bakery_prod_Ter       20.689578
## hs_break_cer_Ter         13.679182
## hs_dairy_Ter             11.181729
## hs_fastfood_Ter          11.895794
## hs_org_food_Ter          10.856862
## hs_proc_meat_Ter         10.827847
## hs_total_fish_Ter        10.984827
## hs_total_fruits_Ter      13.977047
## hs_total_lipids_Ter      12.980031
## hs_total_sweets_Ter      10.749403
## hs_total_veg_Ter         12.258026
## hs_cd_c_Log2             44.886048
## hs_co_c_Log2             45.054544
## hs_cs_c_Log2             34.878871
## hs_cu_c_Log2             60.387604
## hs_hg_c_Log2             43.835480
## hs_mo_c_Log2             50.597285
## hs_pb_c_Log2             39.126962
## hs_dde_cadj_Log2         71.115165
## hs_pcb153_cadj_Log2      74.294671
## hs_pcb170_cadj_Log2     130.998985
## hs_dep_cadj_Log2         50.333164
## hs_pbde153_cadj_Log2     92.865989
## hs_pfhxs_c_Log2          38.593526
## hs_pfoa_c_Log2           53.392093
## hs_pfos_c_Log2           44.360304
## hs_prpa_cadj_Log2        43.848750
## hs_mbzp_cadj_Log2        49.661827
## hs_mibp_cadj_Log2        38.399318
## hs_mnbp_cadj_Log2        45.955350
rf_predictions <- predict(fit_rf, newdata = test_data)

mse_rf <- mean((y_test - rf_predictions)^2)
rmse_rf <- sqrt(mse_rf)

cat("Diet + Chemical + Covariates Random Forest MSE:", mse_rf, "\n")
## Diet + Chemical + Covariates Random Forest MSE: 1.050314
cat("Diet + Chemical + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Covariates Random Forest RMSE: 1.024848

Features such as hs_pcb170_cadj_Log2, hs_pbde153_cadj_Log2, hs_dde_cadj_Log2, and hs_mnbp_cadj_Log2 are among the top predictors, indicating significant contributions from chemical exposures.

GBM

set.seed(101)
fit_gbm <- gbm(hs_zbmi_who ~ ., data = train_data, distribution = "gaussian", n.trees = 100, interaction.depth = 3, shrinkage = 0.01, cv.folds = 5)
summary(fit_gbm)
best_iter <- gbm.perf(fit_gbm, method = "cv")

gbm_predictions <- predict(fit_gbm, newdata = test_data, n.trees = best_iter)
mse_gbm <- mean((y_test - gbm_predictions)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Diet + Chemical + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Chemical + Covariates GBM MSE: 1.15011
cat("Diet + Chemical + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Covariates GBM RMSE: 1.072432

The RMSE also showed a notable improvement. This suggests that the model’s predictions are more accurate with the inclusion of chemical exposure data and additional diet-related features.

Diet + Chemical + Metabolomic + Covariates

In terms of the metabolomic data, the values were excluded since there were too many entries that unable to be imputed by mean or median.

load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/metabol_serum.RData")
metabol_serum_transposed <- as.data.frame(t(metabol_serum.d))
metabol_serum_transposed$ID <- as.integer(rownames(metabol_serum_transposed))

# add the ID column to the first position
metabol_serum_transposed <- metabol_serum_transposed[, c("ID", setdiff(names(metabol_serum_transposed), "ID"))]

# ID is the first column, and the layout is preserved
kable(head(metabol_serum_transposed), align = "c", digits = 2, format = "pipe")
ID metab_1 metab_2 metab_3 metab_4 metab_5 metab_6 metab_7 metab_8 metab_9 metab_10 metab_11 metab_12 metab_13 metab_14 metab_15 metab_16 metab_17 metab_18 metab_19 metab_20 metab_21 metab_22 metab_23 metab_24 metab_25 metab_26 metab_27 metab_28 metab_29 metab_30 metab_31 metab_32 metab_33 metab_34 metab_35 metab_36 metab_37 metab_38 metab_39 metab_40 metab_41 metab_42 metab_43 metab_44 metab_45 metab_46 metab_47 metab_48 metab_49 metab_50 metab_51 metab_52 metab_53 metab_54 metab_55 metab_56 metab_57 metab_58 metab_59 metab_60 metab_61 metab_62 metab_63 metab_64 metab_65 metab_66 metab_67 metab_68 metab_69 metab_70 metab_71 metab_72 metab_73 metab_74 metab_75 metab_76 metab_77 metab_78 metab_79 metab_80 metab_81 metab_82 metab_83 metab_84 metab_85 metab_86 metab_87 metab_88 metab_89 metab_90 metab_91 metab_92 metab_93 metab_94 metab_95 metab_96 metab_97 metab_98 metab_99 metab_100 metab_101 metab_102 metab_103 metab_104 metab_105 metab_106 metab_107 metab_108 metab_109 metab_110 metab_111 metab_112 metab_113 metab_114 metab_115 metab_116 metab_117 metab_118 metab_119 metab_120 metab_121 metab_122 metab_123 metab_124 metab_125 metab_126 metab_127 metab_128 metab_129 metab_130 metab_131 metab_132 metab_133 metab_134 metab_135 metab_136 metab_137 metab_138 metab_139 metab_140 metab_141 metab_142 metab_143 metab_144 metab_145 metab_146 metab_147 metab_148 metab_149 metab_150 metab_151 metab_152 metab_153 metab_154 metab_155 metab_156 metab_157 metab_158 metab_159 metab_160 metab_161 metab_162 metab_163 metab_164 metab_165 metab_166 metab_167 metab_168 metab_169 metab_170 metab_171 metab_172 metab_173 metab_174 metab_175 metab_176 metab_177
430 430 -2.15 -0.71 8.60 0.55 7.05 5.79 3.75 5.07 -1.87 -2.77 -3.31 -2.91 -2.94 -1.82 -4.40 -4.10 -5.41 -5.13 -5.35 -3.39 -5.08 -6.06 -6.06 -4.99 -4.46 -4.63 -3.27 -4.61 2.17 -1.73 -4.97 -4.90 -2.63 -5.29 -2.38 -4.06 -5.11 -5.35 -4.80 -3.92 -3.92 -5.47 -4.22 -2.56 -3.93 5.15 6.03 10.20 5.14 7.82 12.31 7.27 7.08 1.79 7.73 7.98 1.96 6.15 0.98 0.60 4.42 4.36 5.85 1.03 2.74 -2.53 -2.05 -2.91 -1.61 -1.63 5.03 0.14 6.23 -2.95 1.29 1.70 -2.83 4.55 4.05 2.56 -0.29 8.33 9.93 4.89 1.28 2.16 5.82 8.95 7.72 8.41 4.71 0.10 2.02 0.16 5.82 7.45 6.17 6.81 -0.70 -1.25 -0.65 2.05 3.39 4.94 -0.69 -1.44 -2.06 -2.44 -1.30 -0.73 -1.52 -2.43 -3.26 1.97 0.03 1.09 3.98 4.56 4.16 0.42 3.48 4.88 3.84 4.70 4.04 1.58 -0.76 1.75 2.48 4.43 4.68 3.29 0.97 1.03 0.44 1.55 2.26 2.72 0.12 -0.90 -0.50 0.02 -0.18 1.02 -2.69 -1.66 0.47 0.28 6.75 7.67 -2.66 -1.52 7.28 -0.08 2.39 1.55 3.01 2.92 -0.48 6.78 3.90 4.05 3.17 -1.46 3.56 4.60 -3.55 -2.79 -1.98 -1.84 3.98 6.47 7.16 -0.01 6.57 6.86 8.36
1187 1187 -0.69 -0.37 9.15 -1.33 6.89 5.81 4.26 5.08 -2.30 -3.42 -3.63 -3.16 -3.22 -1.57 -4.10 -5.35 -5.68 -6.11 -5.54 -3.50 -5.24 -5.72 -5.97 -4.94 -4.25 -4.46 -3.55 -4.64 1.81 -2.92 -4.44 -4.49 -3.53 -4.94 -3.15 -4.13 -4.47 -4.90 -4.24 -3.49 -3.94 -4.99 -4.02 -2.69 -3.69 5.13 5.57 9.93 6.13 8.47 12.32 6.83 5.94 1.64 6.82 7.74 1.98 6.11 0.99 0.19 4.34 4.36 5.47 0.92 2.69 -2.69 -1.93 -2.79 -1.63 -1.69 4.58 0.41 6.14 -3.06 1.05 2.10 -2.95 4.51 4.30 2.57 0.08 8.27 9.54 4.61 1.39 1.91 5.91 8.59 7.34 8.04 4.29 -0.04 2.17 0.42 5.39 6.95 5.68 6.09 -0.68 -1.29 -0.76 1.84 3.06 4.40 -0.52 -1.52 -1.90 -2.44 -1.46 -1.00 -1.33 -2.41 -3.67 2.48 0.27 1.02 4.19 4.43 4.19 0.33 3.24 4.38 3.92 5.09 4.42 1.01 -0.53 1.36 2.25 4.54 5.10 3.45 0.65 0.83 0.36 1.68 2.56 2.70 0.02 -1.02 -0.93 -0.22 0.11 1.60 -2.70 -1.31 1.08 0.54 6.29 7.97 -3.22 -1.34 7.50 0.48 2.19 1.49 3.09 2.71 -0.38 6.86 3.77 4.31 3.23 -1.82 3.80 5.05 -3.31 -2.18 -2.21 -2.01 4.91 6.84 7.14 0.14 6.03 6.55 7.91
940 940 -0.69 -0.36 8.95 -0.13 7.10 5.86 4.35 5.92 -1.97 -3.40 -3.41 -2.99 -3.01 -1.65 -3.55 -4.82 -5.41 -5.84 -5.13 -2.83 -4.86 -5.51 -5.51 -4.63 -3.73 -4.00 -2.92 -4.21 2.79 -1.41 -4.80 -5.47 -2.10 -5.47 -2.14 -4.18 -4.84 -5.24 -4.64 -3.20 -3.90 -5.24 -3.77 -2.70 -2.76 5.21 5.86 9.78 6.38 8.29 12.49 7.01 6.49 1.97 7.17 7.62 2.40 6.93 1.85 1.45 5.11 5.30 6.27 2.35 3.31 -2.50 -1.41 -2.61 -0.93 -1.03 4.54 1.59 6.03 -2.74 1.79 2.68 -8.16 5.19 5.14 3.16 0.24 9.09 10.25 5.44 1.90 2.46 6.66 9.19 8.24 8.46 5.73 1.10 2.58 1.15 6.37 7.28 6.51 7.20 -0.48 -0.69 -0.02 2.56 3.76 5.33 -0.16 -1.18 -1.18 -2.16 -1.06 -0.19 -0.48 -2.35 -3.16 2.79 0.72 2.14 4.80 4.84 4.55 1.27 4.26 5.23 4.40 5.43 4.56 2.32 0.03 2.15 3.22 5.06 5.28 3.80 1.38 1.58 0.98 2.27 2.94 3.39 0.33 -0.53 0.17 0.53 0.57 1.69 -2.21 -0.76 1.25 0.49 6.49 8.84 -4.02 -1.33 7.42 0.71 2.81 2.03 3.30 3.00 -0.24 7.02 3.82 4.66 3.36 -1.18 3.82 4.91 -2.95 -2.89 -2.43 -2.05 4.25 7.02 7.36 0.14 6.57 6.68 8.12
936 936 -0.19 -0.34 8.54 -0.62 7.01 5.95 4.24 5.41 -1.89 -2.84 -3.38 -3.11 -2.94 -1.45 -3.83 -4.43 -5.61 -5.41 -5.54 -2.94 -4.78 -6.06 -5.88 -4.70 -4.82 -4.46 -2.66 -3.82 2.85 -2.70 -5.16 -5.47 -3.31 -5.61 -2.80 -4.11 -4.97 -4.86 -5.01 -3.63 -3.78 -5.29 -4.17 -2.49 -3.65 5.31 5.60 9.87 6.67 8.05 12.33 6.72 6.42 1.25 7.28 7.37 1.99 6.28 1.17 0.50 4.52 4.43 5.54 1.30 3.08 -2.92 -2.16 -3.18 -1.66 -1.63 4.55 0.53 5.73 -3.27 1.30 1.70 -2.57 4.53 4.14 2.61 -0.18 8.32 9.62 4.82 1.58 1.99 5.82 8.59 7.58 8.39 4.68 0.36 2.01 -0.31 5.71 7.35 6.22 6.66 -0.70 -1.42 -0.62 2.13 3.54 4.85 -0.72 -1.53 -2.04 -2.37 -1.38 -0.96 -1.57 -2.91 -3.60 2.37 0.21 0.92 4.05 4.27 4.33 0.24 3.38 4.45 3.71 4.74 4.44 1.51 -1.73 1.51 2.27 4.37 4.89 3.40 0.66 0.83 0.27 1.50 2.30 2.60 0.14 -0.90 -0.99 -0.53 -0.30 1.14 -3.06 -1.69 0.39 0.19 6.21 8.05 -2.75 -0.87 7.79 0.87 2.48 1.62 3.28 2.93 -0.41 6.91 3.75 4.38 3.20 -1.07 3.81 4.89 -3.36 -2.40 -2.06 -2.03 3.99 7.36 6.94 0.14 6.26 6.47 7.98
788 788 -1.96 -0.35 8.73 -0.80 6.90 5.95 4.88 5.39 -1.55 -2.45 -3.51 -2.84 -2.83 -1.71 -3.91 -4.05 -5.61 -4.63 -5.29 -3.51 -4.86 -5.97 -5.27 -4.90 -4.40 -4.63 -3.11 -3.99 2.87 -2.23 -4.61 -5.04 -3.53 -5.08 -3.02 -4.41 -4.72 -5.18 -4.72 -3.63 -3.61 -5.29 -4.05 -2.31 -3.73 4.69 5.31 9.69 6.76 8.21 12.18 6.75 6.51 1.15 7.38 7.93 1.76 5.68 -0.02 -0.65 4.14 3.36 4.43 0.21 1.98 -2.31 -1.54 -2.30 -1.66 -1.47 4.48 0.88 6.47 -2.50 0.74 1.12 -2.17 4.31 3.50 2.09 -0.60 8.06 9.69 3.99 0.54 1.60 5.60 8.71 7.32 8.03 3.27 -0.98 1.59 -0.20 5.68 7.16 5.57 6.16 -0.79 -1.31 -0.87 2.17 3.23 4.57 -0.93 -1.80 -2.27 -2.51 -1.74 -1.02 -1.92 -2.02 -3.79 1.95 -0.24 0.40 3.73 4.13 3.71 0.03 2.89 4.06 3.54 4.76 3.88 0.53 -2.11 1.27 1.99 4.13 4.58 2.88 0.22 0.39 0.22 1.44 2.02 2.22 0.00 -0.81 -1.10 -0.41 -0.09 1.00 -2.66 -1.55 0.33 0.19 6.47 7.89 -4.40 -1.94 7.65 0.38 1.66 0.84 2.78 2.26 -0.84 6.52 3.53 3.81 2.83 -1.69 3.65 4.47 -3.81 -2.97 -2.88 -2.29 3.88 6.99 7.38 -0.10 6.00 6.52 8.04
698 698 -1.90 -0.63 8.24 -0.46 6.94 5.42 4.70 4.62 -1.78 -3.14 -3.46 -2.90 -2.94 -1.65 -4.20 -4.56 -5.68 -5.61 -5.41 -2.92 -5.04 -5.97 -6.06 -4.90 -4.22 -4.20 -3.05 -4.61 2.15 -2.87 -4.68 -5.08 -3.69 -5.24 -3.63 -4.24 -5.16 -5.35 -4.97 -3.61 -3.99 -5.35 -3.98 -2.59 -3.95 5.15 5.82 10.00 5.54 8.15 12.28 6.80 6.23 1.88 7.07 7.38 2.06 6.79 1.67 1.00 4.79 4.79 5.71 1.99 3.29 -2.13 -1.01 -1.85 -1.23 -0.90 4.41 -0.02 6.09 -2.10 1.66 2.27 -3.48 4.96 4.76 2.64 0.05 8.91 9.99 5.16 1.53 2.11 6.28 8.77 8.03 8.66 5.99 0.87 2.30 0.63 6.23 7.50 6.75 7.22 -0.45 -0.81 -0.11 2.57 3.93 5.16 -0.31 -1.19 -1.25 -1.93 -0.89 0.07 -0.87 -1.12 -3.03 2.61 0.54 1.83 4.50 4.53 4.42 1.15 4.02 4.91 4.06 5.06 4.42 2.02 -1.03 1.87 2.96 4.84 5.08 3.62 1.13 1.23 0.75 2.26 2.80 3.04 0.41 -0.39 0.02 0.31 0.52 1.73 -2.28 -0.73 1.06 0.72 6.44 7.27 -3.08 -1.23 7.35 0.92 2.60 2.00 3.69 3.20 -0.25 7.38 4.15 5.00 3.88 -1.39 4.31 5.20 -3.47 -2.75 -1.97 -1.96 4.18 6.81 6.75 0.02 6.49 5.97 7.78
# specific covariates
load("/Users/allison/Library/CloudStorage/GoogleDrive-aflouie@usc.edu/My Drive/HELIX_data/HELIX.RData")
filtered_chem_diet <- codebook %>%
  filter(domain %in% c("Chemicals", "Lifestyles") & period == "Postnatal" & subfamily != "Allergens")

# specific covariates
filtered_covariates <- codebook %>%
  filter(domain == "Covariates" & 
         variable_name %in% c("ID", "e3_sex_None", "e3_yearbir_None", "hs_child_age_None"))

#specific phenotype variables
filtered_phenotype <- codebook %>%
  filter(domain == "Phenotype" & 
         variable_name %in% c("hs_zbmi_who"))

# combining all necessary variables together
combined_codebook <- bind_rows(filtered_chem_diet, filtered_covariates, filtered_phenotype)

outcome_and_cov <- cbind(covariates, outcome_BMI)
outcome_and_cov <- outcome_and_cov[, !duplicated(colnames(outcome_and_cov))]
outcome_and_cov <- outcome_and_cov %>%
  dplyr::select(ID, hs_child_age_None, e3_sex_None, e3_yearbir_None, hs_zbmi_who)
#the full chemicals list
chemicals_specific <- c(
  "hs_cd_c_Log2",
  "hs_co_c_Log2",
  "hs_cs_c_Log2",
  "hs_cu_c_Log2",
  "hs_hg_c_Log2",
  "hs_mo_c_Log2",
  "hs_pb_c_Log2",
  "hs_dde_cadj_Log2",
  "hs_pcb153_cadj_Log2",
  "hs_pcb170_cadj_Log2",
  "hs_dep_cadj_Log2",
  "hs_pbde153_cadj_Log2",
  "hs_pfhxs_c_Log2",
  "hs_pfoa_c_Log2",
  "hs_pfos_c_Log2",
  "hs_prpa_cadj_Log2",
  "hs_mbzp_cadj_Log2",
  "hs_mibp_cadj_Log2",
  "hs_mnbp_cadj_Log2"
)

#postnatal diet for child
postnatal_diet <- c(
  "h_bfdur_Ter",
  "hs_bakery_prod_Ter",
  "hs_dairy_Ter",
  "hs_fastfood_Ter",
  "hs_org_food_Ter",
  "hs_readymade_Ter",
  "hs_total_bread_Ter",
  "hs_total_fish_Ter",
  "hs_total_fruits_Ter",
  "hs_total_lipids_Ter",
  "hs_total_potatoes_Ter",
  "hs_total_sweets_Ter",
  "hs_total_veg_Ter"
)

covariates_selected <- c("hs_child_age_None", "e3_sex_None", "e3_yearbir_None")

all_columns <- c(chemicals_specific, postnatal_diet)
extracted_exposome <- exposome %>% dplyr::select(all_of(all_columns))

selected_id_data <- cbind(outcome_and_cov, extracted_exposome)

# ID is the common identifier in both datasets
combined_data <- merge(selected_id_data, metabol_serum_transposed, by = "ID", all = TRUE)

selected_metabolomics_data <- combined_data %>% dplyr::select(-c(ID))
head(selected_metabolomics_data)

Group LASSO

selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()

set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7, list = FALSE, times = 1)
metabol_train_data <- selected_metabolomics_data[trainIndex,]
metabol_test_data  <- selected_metabolomics_data[-trainIndex,]

x_train <- model.matrix(hs_zbmi_who ~ . -1, metabol_train_data)
y_train <- metabol_train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ . -1, metabol_test_data)
y_test <- metabol_test_data$hs_zbmi_who

penalty_factors <- rep(1, ncol(x_train))
penalty_factors[colnames(x_train) %in% covariates_selected] <- 0

num_covariates <- length(covariates_selected)
num_diet <- length(diet_selected)
num_chemicals <- length(chemicals_selected)
num_metabolomics <- ncol(metabol_serum_transposed) - 1  # subtract for the ID column

group_indices <- c(
  rep(1, num_covariates),     # Group 1: Covariates
  rep(2, num_diet),           # Group 2: Diet
  rep(3, num_chemicals),      # Group 3: Chemicals
  rep(4, num_metabolomics)    # Group 4: Metabolomics
)

if (length(group_indices) < ncol(x_train)) {
  group_indices <- c(group_indices, rep(5, ncol(x_train) - length(group_indices)))
}

length(group_indices) == ncol(x_train)  # This should be TRUE
## [1] TRUE
# Fit group lasso model
set.seed(101)
group_lasso_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grLasso", penalty.factor = penalty_factors, family = "gaussian")

group_lasso_predictions <- predict(group_lasso_model, x_test, type = "response")

mse_group_lasso <- mean((y_test - group_lasso_predictions)^2)
rmse_group_lasso <- sqrt(mse_group_lasso)

cat("Group Lasso Test MSE:", mse_group_lasso, "\n")
## Group Lasso Test MSE: 0.8713226
cat("Group Lasso Test RMSE:", rmse_group_lasso, "\n")
## Group Lasso Test RMSE: 0.9334466

Group Ridge

set.seed(101)
group_ridge_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grMCP", penalty.factor = penalty_factors, family = "gaussian")
group_ridge_predictions <- predict(group_ridge_model, x_test, type = "response")
mse_group_ridge <- mean((y_test - group_ridge_predictions)^2)
rmse_group_ridge <- sqrt(mse_group_ridge)
cat("Group Ridge MSE:", mse_group_ridge, "\n")
## Group Ridge MSE: 0.8883061
cat("Group Ridge RMSE:", rmse_group_ridge, "\n")
## Group Ridge RMSE: 0.9424999

Group Elastic Net

set.seed(101)
group_enet_model <- grpreg(x_train, y_train, group = group_indices, penalty = "grSCAD", penalty.factor = penalty_factors, family = "gaussian")
group_enet_predictions <- predict(group_enet_model, x_test, type = "response")
mse_group_enet <- mean((y_test - group_enet_predictions)^2)
rmse_group_enet <- sqrt(mse_group_enet)
cat("Group Elastic Net MSE:", mse_group_enet, "\n")
## Group Elastic Net MSE: 0.8912767
cat("Group Elastic Net RMSE:", rmse_group_enet, "\n")
## Group Elastic Net RMSE: 0.9440745

Decision Trees

selected_metabolomics_data <- selected_metabolomics_data %>% na.omit()

set.seed(101)
trainIndex <- createDataPartition(selected_metabolomics_data$hs_zbmi_who, p = .7, 
                                  list = FALSE, 
                                  times = 1)
train_data <- selected_metabolomics_data[ trainIndex,]
test_data  <- selected_metabolomics_data[-trainIndex,]

x_train <- model.matrix(hs_zbmi_who ~ ., train_data)[,-1]
y_train <- train_data$hs_zbmi_who
x_test <- model.matrix(hs_zbmi_who ~ ., test_data)[,-1]
y_test <- test_data$hs_zbmi_who

set.seed(101)
fit_tree <- rpart(hs_zbmi_who ~ ., data = train_data, method = "anova")

rpart.plot(fit_tree)

printcp(fit_tree)
## 
## Regression tree:
## rpart(formula = hs_zbmi_who ~ ., data = train_data, method = "anova")
## 
## Variables actually used in tree construction:
##  [1] hs_cu_c_Log2        hs_dde_cadj_Log2    hs_pcb170_cadj_Log2
##  [4] metab_100           metab_129           metab_140          
##  [7] metab_141           metab_142           metab_157          
## [10] metab_161           metab_176           metab_47           
## [13] metab_58            metab_8             metab_94           
## [16] metab_95           
## 
## Root node error: 1254/841 = 1.4911
## 
## n= 841 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.089116      0   1.00000 1.00275 0.052776
## 2  0.069078      1   0.91088 0.94590 0.050339
## 3  0.037249      2   0.84181 0.91981 0.049729
## 4  0.034324      3   0.80456 0.90111 0.048711
## 5  0.025576      4   0.77023 0.89055 0.049604
## 6  0.021936      5   0.74466 0.90166 0.048993
## 7  0.021276      6   0.72272 0.90895 0.049515
## 8  0.020244      8   0.68017 0.90807 0.050310
## 9  0.016938      9   0.65992 0.91277 0.050356
## 10 0.015553     10   0.64299 0.90306 0.050272
## 11 0.015178     11   0.62743 0.90122 0.050368
## 12 0.014756     12   0.61226 0.90066 0.050300
## 13 0.014022     13   0.59750 0.91203 0.051412
## 14 0.013859     14   0.58348 0.92227 0.051732
## 15 0.012010     15   0.56962 0.93488 0.053622
## 16 0.011088     16   0.55761 0.93327 0.053609
## 17 0.011027     17   0.54652 0.92988 0.053560
## 18 0.010000     18   0.53549 0.93467 0.053891
plotcp(fit_tree)

optimal_cp <- fit_tree$cptable[which.min(fit_tree$cptable[,"xerror"]), "CP"]

pruned_tree <- prune(fit_tree, cp = optimal_cp)

rpart.plot(pruned_tree)

pruned_tree_predictions <- predict(pruned_tree, newdata = test_data)
mse_pruned_tree <- mean((test_data$hs_zbmi_who - pruned_tree_predictions)^2)
rmse_pruned_tree <- sqrt(mse_pruned_tree)

cat("Pruned Decision Tree MSE:", mse_pruned_tree, "\n")
## Pruned Decision Tree MSE: 1.272858
cat("Pruned Decision Tree RMSE:", rmse_pruned_tree, "\n")
## Pruned Decision Tree RMSE: 1.12821

Random Forest

set.seed(101)
rf_model <- randomForest(hs_zbmi_who ~ . , data = train_data, ntree = 500)
importance(rf_model)
##                       IncNodePurity
## hs_child_age_None         4.7927709
## e3_sex_None               0.5707810
## e3_yearbir_None           5.3687139
## hs_cd_c_Log2              4.8057550
## hs_co_c_Log2              5.5171622
## hs_cs_c_Log2              5.0356667
## hs_cu_c_Log2             12.4603237
## hs_hg_c_Log2              6.5609155
## hs_mo_c_Log2              9.6711331
## hs_pb_c_Log2              6.0711093
## hs_dde_cadj_Log2         13.2357957
## hs_pcb153_cadj_Log2      42.5218764
## hs_pcb170_cadj_Log2      78.6536355
## hs_dep_cadj_Log2          5.7451267
## hs_pbde153_cadj_Log2     28.0252092
## hs_pfhxs_c_Log2           5.4200119
## hs_pfoa_c_Log2            9.0245854
## hs_pfos_c_Log2            6.9947002
## hs_prpa_cadj_Log2         5.0401167
## hs_mbzp_cadj_Log2         4.4169868
## hs_mibp_cadj_Log2         4.0028339
## hs_mnbp_cadj_Log2         5.0996136
## h_bfdur_Ter               2.8968599
## hs_bakery_prod_Ter        2.9013737
## hs_dairy_Ter              1.2216138
## hs_fastfood_Ter           0.9580573
## hs_org_food_Ter           1.1570809
## hs_readymade_Ter          1.4572844
## hs_total_bread_Ter        1.1592172
## hs_total_fish_Ter         1.4961459
## hs_total_fruits_Ter       1.1767595
## hs_total_lipids_Ter       1.3061318
## hs_total_potatoes_Ter     1.2547008
## hs_total_sweets_Ter       0.9540810
## hs_total_veg_Ter          1.0142937
## metab_1                   4.5011841
## metab_2                   4.1531041
## metab_3                   3.4379265
## metab_4                   5.7831834
## metab_5                   2.9178829
## metab_6                   9.2084145
## metab_7                   4.5028527
## metab_8                  36.1613265
## metab_9                   2.8921966
## metab_10                  3.3253911
## metab_11                  4.0194203
## metab_12                  3.3756205
## metab_13                  4.1227190
## metab_14                  5.2837840
## metab_15                  4.4314031
## metab_16                  2.6462109
## metab_17                  2.4025892
## metab_18                  3.0563512
## metab_19                  2.2951339
## metab_20                  3.9067077
## metab_21                  2.8607845
## metab_22                  2.6869075
## metab_23                  2.8972758
## metab_24                  3.7219939
## metab_25                  3.4797174
## metab_26                  7.4516177
## metab_27                  3.3263961
## metab_28                  3.6858414
## metab_29                  3.3029111
## metab_30                 18.5312981
## metab_31                  3.4436727
## metab_32                  3.0545040
## metab_33                  5.2038360
## metab_34                  2.4118531
## metab_35                  7.7143364
## metab_36                  3.8489845
## metab_37                  3.5293186
## metab_38                  2.5404708
## metab_39                  2.5165610
## metab_40                  5.1546331
## metab_41                  3.8921235
## metab_42                  6.1284140
## metab_43                  3.4008043
## metab_44                  3.5698540
## metab_45                  3.6758543
## metab_46                  5.1802108
## metab_47                  6.2120208
## metab_48                 11.7806465
## metab_49                 33.0932505
## metab_50                  9.9343588
## metab_51                  6.2231737
## metab_52                  3.3042007
## metab_53                  5.1286048
## metab_54                  4.9482602
## metab_55                  7.9980726
## metab_56                  4.6904911
## metab_57                  4.7919533
## metab_58                  3.2769554
## metab_59                  5.7319035
## metab_60                  3.9192335
## metab_61                  3.0283668
## metab_62                  4.3267745
## metab_63                  4.8698592
## metab_64                  3.9307501
## metab_65                  3.8785658
## metab_66                  2.5032525
## metab_67                  2.7330496
## metab_68                  3.5029673
## metab_69                  2.4526201
## metab_70                  3.2796735
## metab_71                  4.6222410
## metab_72                  3.4044271
## metab_73                  3.0460977
## metab_74                  2.7018733
## metab_75                  3.8889307
## metab_76                  2.8113720
## metab_77                  4.7832198
## metab_78                  4.9096271
## metab_79                  4.0912870
## metab_80                  3.5334220
## metab_81                  3.3897748
## metab_82                  4.6950398
## metab_83                  3.3638699
## metab_84                  3.0885189
## metab_85                  4.6423208
## metab_86                  3.2104952
## metab_87                  3.8282082
## metab_88                  2.8852207
## metab_89                  2.7940235
## metab_90                  2.6775937
## metab_91                  3.2999703
## metab_92                  3.4644294
## metab_93                  2.8208312
## metab_94                  8.2011811
## metab_95                 51.6744901
## metab_96                  8.1118854
## metab_97                  3.0816526
## metab_98                  3.2655000
## metab_99                  4.4506345
## metab_100                 3.9391564
## metab_101                 3.6374127
## metab_102                 5.7070910
## metab_103                 4.1270357
## metab_104                 5.6319496
## metab_105                 3.3000728
## metab_106                 3.4540736
## metab_107                 3.8068032
## metab_108                 4.0689264
## metab_109                 5.3846332
## metab_110                 7.8695016
## metab_111                 2.7218197
## metab_112                 2.7613608
## metab_113                 5.5357192
## metab_114                 4.3740721
## metab_115                 4.3476546
## metab_116                 6.1735729
## metab_117                 7.0544259
## metab_118                 2.4601605
## metab_119                 5.4450427
## metab_120                 7.1242577
## metab_121                 3.7051871
## metab_122                 6.2282665
## metab_123                 3.2628782
## metab_124                 3.3382452
## metab_125                 3.4201616
## metab_126                 2.4057902
## metab_127                 5.0861475
## metab_128                 6.4178268
## metab_129                 4.2203870
## metab_130                 3.4512521
## metab_131                 3.2548319
## metab_132                 3.2654363
## metab_133                 2.6708217
## metab_134                 3.2328840
## metab_135                 5.2587159
## metab_136                 5.0406508
## metab_137                 6.3400908
## metab_138                 4.7455979
## metab_139                 3.7303384
## metab_140                 2.7325221
## metab_141                 7.9353350
## metab_142                13.5320442
## metab_143                 9.6865623
## metab_144                 3.9033383
## metab_145                 4.2785181
## metab_146                 4.2770667
## metab_147                 3.5426166
## metab_148                 3.3025573
## metab_149                 5.6260285
## metab_150                 4.9789354
## metab_151                 3.5939464
## metab_152                 4.7580724
## metab_153                 4.3991465
## metab_154                 5.0953881
## metab_155                 2.5069093
## metab_156                 2.8819060
## metab_157                 4.3767383
## metab_158                 3.9225685
## metab_159                 3.3643529
## metab_160                 7.0114089
## metab_161                28.1188870
## metab_162                 3.9630717
## metab_163                14.4399347
## metab_164                 7.1747718
## metab_165                 3.9310471
## metab_166                 3.8429719
## metab_167                 3.3589423
## metab_168                 3.1719266
## metab_169                 4.2720474
## metab_170                 4.2149550
## metab_171                 4.9419993
## metab_172                 3.7296383
## metab_173                 4.0956131
## metab_174                 4.6193564
## metab_175                 3.7661891
## metab_176                 6.2453467
## metab_177                14.2757880
par(mar = c(6, 14, 4, 4) + 0.1) 
varImpPlot(rf_model, cex = 0.6)

rf_predictions <- predict(rf_model, newdata = test_data)

rf_mse <- mean((rf_predictions - y_test)^2)
rmse_rf <- sqrt(rf_mse)
cat(" Diet + Chemical + Metabolomic + Covariates Random Forest MSE:", rf_mse, "\n")
##  Diet + Chemical + Metabolomic + Covariates Random Forest MSE: 1.00938
cat("Diet + Chemical + Metabolomic + Covariates Random Forest RMSE:", rmse_rf, "\n")
## Diet + Chemical + Metabolomic + Covariates Random Forest RMSE: 1.004679

GBM

gbm_model <- gbm(hs_zbmi_who ~ ., data = train_data, 
                 distribution = "gaussian",
                 n.trees = 1000,
                 interaction.depth = 3,
                 n.minobsinnode = 10,
                 shrinkage = 0.01,
                 cv.folds = 5,
                 verbose = TRUE)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.4854             nan     0.0100    0.0031
##      2        1.4805             nan     0.0100    0.0028
##      3        1.4757             nan     0.0100    0.0018
##      4        1.4713             nan     0.0100    0.0028
##      5        1.4668             nan     0.0100    0.0017
##      6        1.4626             nan     0.0100    0.0020
##      7        1.4581             nan     0.0100    0.0040
##      8        1.4534             nan     0.0100    0.0032
##      9        1.4483             nan     0.0100    0.0038
##     10        1.4446             nan     0.0100    0.0029
##     20        1.3979             nan     0.0100    0.0035
##     40        1.3186             nan     0.0100    0.0021
##     60        1.2523             nan     0.0100    0.0018
##     80        1.1946             nan     0.0100    0.0013
##    100        1.1448             nan     0.0100    0.0007
##    120        1.0983             nan     0.0100    0.0003
##    140        1.0562             nan     0.0100    0.0006
##    160        1.0160             nan     0.0100    0.0006
##    180        0.9785             nan     0.0100    0.0004
##    200        0.9443             nan     0.0100    0.0010
##    220        0.9120             nan     0.0100    0.0005
##    240        0.8802             nan     0.0100    0.0006
##    260        0.8527             nan     0.0100   -0.0002
##    280        0.8277             nan     0.0100    0.0009
##    300        0.8038             nan     0.0100    0.0001
##    320        0.7802             nan     0.0100    0.0003
##    340        0.7576             nan     0.0100    0.0004
##    360        0.7374             nan     0.0100    0.0004
##    380        0.7185             nan     0.0100    0.0003
##    400        0.7012             nan     0.0100   -0.0001
##    420        0.6842             nan     0.0100    0.0001
##    440        0.6683             nan     0.0100    0.0001
##    460        0.6529             nan     0.0100    0.0000
##    480        0.6372             nan     0.0100   -0.0002
##    500        0.6226             nan     0.0100    0.0000
##    520        0.6085             nan     0.0100    0.0002
##    540        0.5954             nan     0.0100    0.0003
##    560        0.5827             nan     0.0100   -0.0002
##    580        0.5703             nan     0.0100   -0.0003
##    600        0.5584             nan     0.0100   -0.0002
##    620        0.5477             nan     0.0100   -0.0001
##    640        0.5368             nan     0.0100    0.0001
##    660        0.5267             nan     0.0100   -0.0000
##    680        0.5159             nan     0.0100    0.0001
##    700        0.5062             nan     0.0100    0.0001
##    720        0.4966             nan     0.0100   -0.0000
##    740        0.4872             nan     0.0100   -0.0002
##    760        0.4786             nan     0.0100   -0.0003
##    780        0.4704             nan     0.0100   -0.0000
##    800        0.4618             nan     0.0100   -0.0001
##    820        0.4538             nan     0.0100   -0.0000
##    840        0.4458             nan     0.0100   -0.0001
##    860        0.4381             nan     0.0100   -0.0001
##    880        0.4307             nan     0.0100   -0.0001
##    900        0.4237             nan     0.0100   -0.0001
##    920        0.4164             nan     0.0100   -0.0002
##    940        0.4097             nan     0.0100   -0.0002
##    960        0.4032             nan     0.0100   -0.0001
##    980        0.3966             nan     0.0100   -0.0000
##   1000        0.3905             nan     0.0100   -0.0000
summary(gbm_model)
# finding the best number of trees based on cross-validation
best_trees <- gbm.perf(gbm_model, method = "cv")

predictions_gbm <- predict(gbm_model, test_data, n.trees = best_trees)
mse_gbm <- mean((y_test - predictions_gbm)^2)
rmse_gbm <- sqrt(mse_gbm)

cat("Diet + Chemical + Metabolomic + Covariates GBM MSE:", mse_gbm, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM MSE: 0.8857186
cat("Diet + Chemical + Metabolomic + Covariates GBM RMSE:", rmse_gbm, "\n")
## Diet + Chemical + Metabolomic + Covariates GBM RMSE: 0.9411262
summary(gbm_model)

Including diet, chemical, and metabolomic variables with the covariates leads to the best improvements in model performance, indicating these additional predictors provide substantial predictive value.

The GBM model shows the best performance among the advanced models, highlighting its ability to effectively leverage a comprehensive set of predictors.

Overall Comparison of MSE/RMSE

results <- data.frame(
  Model = rep(c("Lasso", "Ridge", "Elastic Net", "Decision Tree", "Random Forest", "GBM"), each = 5),
  Data_Set = rep(c("Covariates", "Diet + Covariates", "Chemicals + Covariates", "Diet + Chemical + Covariates", "Diet + Chemical + Metabolomic + Covariates"), 6),
  MSE = c(1.327, 1.297, 1.083, 1.087, 0.871,
          1.327, 1.288, 1.072, 1.086, 0.888,
          1.327, 1.295, 1.082, 1.089, 0.891,
          1.319, 1.319, 1.210, 1.187, 1.273,
          1.342, 1.299, 1.062, 1.050, 1.009,
          1.320, 1.295, 1.151, 1.150, 0.889),
  RMSE = c(1.152, 1.139, 1.041, 1.042, 0.933,
           1.152, 1.135, 1.036, 1.042, 0.942,
           1.152, 1.138, 1.040, 1.043, 0.944,
           1.149, 1.148, 1.100, 1.090, 1.128,
           1.159, 1.140, 1.031, 1.025, 1.005,
           1.149, 1.138, 1.073, 1.072, 0.943)
)

mse_plot <- ggplot(results, aes(x = Data_Set, y = MSE, fill = Model)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

rmse_plot <- ggplot(results, aes(x = Data_Set, y = RMSE, fill = Model)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(mse_plot)

print(rmse_plot)

# filter results for Lasso model
lasso_results <- subset(results, Model == "Lasso")

mse_lasso_plot <- ggplot(lasso_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Lasso Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

rmse_lasso_plot <- ggplot(lasso_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Lasso Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(mse_lasso_plot)

print(rmse_lasso_plot)

# filter results for ridge model
ridge_results <- subset(results, Model == "Ridge")

mse_ridge_plot <- ggplot(ridge_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Ridge Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

rmse_ridge_plot <- ggplot(ridge_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Ridge Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(mse_ridge_plot)

print(rmse_ridge_plot)

# filter results for elastic net model
enet_results <- subset(results, Model == "Elastic Net")

mse_enet_plot <- ggplot(enet_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Elastic Net Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

rmse_enet_plot <- ggplot(enet_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Elastic Net Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(mse_enet_plot)

print(rmse_enet_plot)

# filter results for decision tree model
tree_results <- subset(results, Model == "Decision Tree")

mse_tree_plot <- ggplot(tree_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Decision Tree Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

rmse_tree_plot <- ggplot(lasso_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Decision Tree Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(mse_tree_plot)

print(rmse_tree_plot)

# filter results for random forest model
rf_results <- subset(results, Model == "Random Forest")

mse_rf_plot <- ggplot(rf_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Random Forest Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

rmse_rf_plot <- ggplot(rf_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Random Forest Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(mse_rf_plot)

print(rmse_rf_plot)

# filter results for elastic net model
gbm_results <- subset(results, Model == "GBM")

mse_gbm_plot <- ggplot(gbm_results, aes(x = Data_Set, y = MSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "GBM Model MSE Comparison", x = "Data Set", y = "Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

rmse_gbm_plot <- ggplot(gbm_results, aes(x = Data_Set, y = RMSE, fill = Data_Set)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "GBM Model RMSE Comparison", x = "Data Set", y = "Root Mean Squared Error") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(mse_gbm_plot)

print(rmse_gbm_plot)